1 Introduction

1.1 Description of Dataset

Our dataset originates from Kaggle. It contains information on health indicators regarding diabetes based on the Behavioral Risk Factor Surveillance System (BRFSS), which is a health-related telephone survey performed annually by American Centers for Disease Control and Prevention (CDC). The dataset originally had 330 features (columns) but based on diabetes research on the risk factors of diabetes and other chronic health conditions, the publisher of the dataset cleaned it up accordingly. This resulted in 21 features along with the target variable Diabetes. Here, we use the dataset from the year 2015 which contains 70692 observations. See Table 1 below for the description of each variable in detail.

Table 1. Description of variables
Variable Description Coding
Diabetes_binary Respondent has diabetes 0 = no, 1 = yes
HighBP Respondent has high blood pressure 0 = no, 1 = yes
HighChol Respondent has high cholesterol 0 = no, 1 = yes
CholCheck Respondent had cholesterol check in 5 years 0 = no, 1 = yes
BMI Body Mass Index of respondent unit = \(kg\)/\(m^2\)
Smoker Respondent smoked at least 100 cigarettes 0 = no, 1 = yes
Stoke Respondent had a stroke 0 = no, 1 = yes
HeartDiseaseorAttack Respondent had coronary heart disease or myocardial infarction 0 = no, 1 = yes
PhysActivity Respondent had physical activity in past 30 days 0 = no, 1 = yes
Fruits Respondent consumes fruit 1 or more times per day 0 = no, 1 = yes
Veggies Respondent consumes vegetables 1 or more times per day 0 = no, 1 = yes
HvyAlcoholConsump Respondent consumes high amounts of alcohol 0 = no, 1 = yes
AnyHealthcare Respondent has any kind of health care coverage 0 = no, 1 = yes
NoDocbcCost Respondent needed to see a doctor but could not because of cost 0 = no, 1 = yes
GenHlth Respondent’s general health 1 = excellent, 2 = very good, 3 = good, 4 = fair, 5 = poor
MentHlth Number of days of poor mental health scale 1-30 days
PhysHlth Number of days physical illness or injury days in past 30 days scale 1-30 days
DiffWalk Respondent has serious difficulty walking or climbing stairs? 0 = no, 1 = yes
Sex Indicate sex of respondent 0 = female 1 = male
Age 13-level age category 1 = Age 18 to 24, 2 = Age 25 to 29, 3 = Age 30 to 34, 4 = Age 35 to 39, 5 = Age 40 to 44, 6 = Age 45 to 49, 7 = Age 50 to 54, 8 = Age 55 to 59, 9 = Age 60 to 64, 10 = Age 65 to 69, 11 = Age 70 to 74, 12 = Age 75 to 79, 13 = Age 80 or older
Education 6-level education level category 1 = Never attended school or only kindergarten, 2 = Grades 1 through 8 (Elementary), 3 = Grades 9 through 11 (Some high school), 4 = Grade 12 or GED (High school graduate), 5 = College 1 year to 3 years (Some college or technical school), 6 = College 4 years or more (College graduate)
Income 8-level income category 1 = Less than $10,000, 2 = Less than $15,000 ($10,000 to less than $15,000), 3 = Less than $20,000 ($15,000 to less than $20,000), 4 = Less than $25,000 ($20,000 to less than $25,000), 5 = Less than $35,000 ($25,000 to less than $35,000), 6 = Less than $50,000 ($35,000 to less than $50,000), 7 = Less than $75,000 ($50,000 to less than $75,000), 8 = $75,000 or more

1.2 Goal

The goal of this analysis is twofold. First, we aim to build classification models that predict whether a person has diabetes or not based on their health indicators. For that, we present two models – a relatively simple one and a more complex one – and evaluate their individual and comparative performance. Second, we want to get an insight into the relevant factors that are associated with developing diabetes, the so-called risk factors among the given health indicators (predictors) in the dataset.

1.3 Load Packages & Import Data

We load the relevant packages and import the diabetes dataset.

## load packages
library(magrittr)       # for using pipes
library(ggplot2)        # for creating plots
library(purrr)          # for using map function
library(ggpubr)         # for using ggarrange function
library(caret)          # for general model fitting and evaluation
library(knitr)          # for kable tables
library(kableExtra)     # for nice tables
library(gtsummary)      # for creating a table split by group
library(tidyverse)      # for data wrangling
library(devtools)       # source the r script from a github repo
library(xgboost)        # for xgboost modeling
library(RColorBrewer)   # for color specification in ggplots
library(lm.beta)        # for standardized coefficients
library(plotROC)        # for using geom_roc()
library(pROC)           # for creating roc curves
library(pander)         # for quick table building
## load the data
diabetes <- read.csv("Data/Data.csv")

1.4 Set Seed & Recode Variable

We set the seed for reproducibility.

## set the seed
set.seed(123)

Some variables are misspecified and we correspondingly change their variable type (to factors).

## recode variables
diabetes %<>% mutate(across(-c(BMI, Age, PhysHlth, MentHlth), as.factor))

2 Data Inspection

2.1 Descriptive Statistics

## checking for the presence of missing values
any(is.na(diabetes)) 
[1] FALSE

There are no missing values in the dataset.

## summary statistics table using the gtsummary table
diabetes %>% 
   # rearrange the variables such that factor comes after numeric variables 
   relocate(where(is.factor), .after = where(is.numeric)) %>% 
   mutate(
      across(
         # recode the factor variable levels
         c(Diabetes_binary, HighBP, HighChol, CholCheck, Smoker, Stroke, 
               HeartDiseaseorAttack, PhysActivity, Fruits, Veggies, 
               HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk),
         ~ fct_recode(.x, "Yes " = "1", "No " = "0")),
         Sex = fct_recode(Sex, "Male" = "1", "Female" = "0"),
         GenHlth = fct_recode(
            GenHlth, "excellent" = "1","very good" =  "2", "good" = "3", 
            "fair" ="4", "poor" = "5"),
         Education = fct_recode(
            Education, "Never attended" = "1", "Elementary" = "2", 
            "Some high school" =  "3", "High school graduate" = "4" , 
            "Some college" = "5", "College graduate" = "6"),
         Income = fct_recode(
            Income, "< 10,000" = "1", "10,000 - 15,000" = "2", 
            "15,000 - 20,000" = "3", "20,000 - 25,000" = "4", 
            "25,000 - 35,000" = "5", "35,000 - 50,000" = "6", 
            "50,000 - 75,000" = "7", "> 75,000" = "8")) %>% 
   tbl_summary(
      # show summary statistics by diabetes group
      by = Diabetes_binary,
      type = c(BMI, MentHlth, PhysHlth, Age) ~ "continuous",
      label = list(
         # description of each variable
         c(Age) ~ "Age (see above)",
         c(GenHlth) ~ "Genetic Health",
         c(MentHlth) ~ "Mental Health",
         c(HighBP) ~ "High Blood Pressure",
         c(HighChol) ~ "High Cholesterol",
         c(CholCheck) ~ "Cholesterol Check in 5 Years",
         c(HeartDiseaseorAttack) ~ "Coronary Heart Disease or Myocardial Infarction",
         c(PhysActivity) ~ "Physical Activity in Past 30 Days",
         c(Fruits) ~ "Consumes Fruit 1 Or More Times per Day",
         c(Veggies) ~ "Consumes Vegetables 1 Or More Times per Day",
         c(DiffWalk) ~ "Serious Difficulty Walking",
         c(HvyAlcoholConsump) ~ "Consumes High Amounts of Alcohol",
         c(AnyHealthcare) ~ "Health Care Coverage",
         c(NoDocbcCost) ~ "Could Not See Doctor Because of Cost",
         c(PhysHlth) ~ "Physical Health"),
      # specify the summary statistics
      statistic = list(
         all_continuous() ~ "{mean}, {median} ({sd})",
         all_categorical() ~ "{n} ({p}%)")) %>%
      modify_footnote(update = everything() ~ NA) %>%
   add_overall() %>% 
   bold_labels() %>% 
   # modify the table formatting
   modify_header(label ~ "**Variable**") %>% 
   modify_spanning_header(c("stat_1", "stat_2") ~ "**Diabetes**") %>% 
   modify_caption(caption = "Summary Table") %>% 
   as_kable_extra() %>% 
   kable_classic(bootstrap_options = "striped", full_width = TRUE) %>% 
   row_spec(0, bold = TRUE)
Table 2.1: Summary Table
Diabetes
Variable Overall, N = 70,692 No , N = 35,346 Yes , N = 35,346
BMI 30, 29 (7) 28, 27 (6) 32, 31 (7)
Mental Health 4, 0 (8) 3, 0 (7) 4, 0 (9)
Physical Health 6, 0 (10) 4, 0 (8) 8, 1 (11)
Age (see above) 8.6, 9.0 (2.9) 7.8, 8.0 (3.1) 9.4, 10.0 (2.3)
High Blood Pressure
No 30,860 (44%) 22,118 (63%) 8,742 (25%)
Yes 39,832 (56%) 13,228 (37%) 26,604 (75%)
High Cholesterol
No 33,529 (47%) 21,869 (62%) 11,660 (33%)
Yes 37,163 (53%) 13,477 (38%) 23,686 (67%)
Cholesterol Check in 5 Years
No 1,749 (2.5%) 1,508 (4.3%) 241 (0.7%)
Yes 68,943 (98%) 33,838 (96%) 35,105 (99%)
Smoker
No 37,094 (52%) 20,065 (57%) 17,029 (48%)
Yes 33,598 (48%) 15,281 (43%) 18,317 (52%)
Stroke
No 66,297 (94%) 34,219 (97%) 32,078 (91%)
Yes 4,395 (6.2%) 1,127 (3.2%) 3,268 (9.2%)
Coronary Heart Disease or Myocardial Infarction
No 60,243 (85%) 32,775 (93%) 27,468 (78%)
Yes 10,449 (15%) 2,571 (7.3%) 7,878 (22%)
Physical Activity in Past 30 Days
No 20,993 (30%) 7,934 (22%) 13,059 (37%)
Yes 49,699 (70%) 27,412 (78%) 22,287 (63%)
Consumes Fruit 1 Or More Times per Day
No 27,443 (39%) 12,790 (36%) 14,653 (41%)
Yes 43,249 (61%) 22,556 (64%) 20,693 (59%)
Consumes Vegetables 1 Or More Times per Day
No 14,932 (21%) 6,322 (18%) 8,610 (24%)
Yes 55,760 (79%) 29,024 (82%) 26,736 (76%)
Consumes High Amounts of Alcohol
No 67,672 (96%) 33,158 (94%) 34,514 (98%)
Yes 3,020 (4.3%) 2,188 (6.2%) 832 (2.4%)
Health Care Coverage
No 3,184 (4.5%) 1,762 (5.0%) 1,422 (4.0%)
Yes 67,508 (95%) 33,584 (95%) 33,924 (96%)
Could Not See Doctor Because of Cost
No 64,053 (91%) 32,449 (92%) 31,604 (89%)
Yes 6,639 (9.4%) 2,897 (8.2%) 3,742 (11%)
Genetic Health
excellent 8,282 (12%) 7,142 (20%) 1,140 (3.2%)
very good 19,872 (28%) 13,491 (38%) 6,381 (18%)
good 23,427 (33%) 9,970 (28%) 13,457 (38%)
fair 13,303 (19%) 3,513 (9.9%) 9,790 (28%)
poor 5,808 (8.2%) 1,230 (3.5%) 4,578 (13%)
Serious Difficulty Walking
No 52,826 (75%) 30,601 (87%) 22,225 (63%)
Yes 17,866 (25%) 4,745 (13%) 13,121 (37%)
Sex
Female 38,386 (54%) 19,975 (57%) 18,411 (52%)
Male 32,306 (46%) 15,371 (43%) 16,935 (48%)
Education
Never attended 75 (0.1%) 28 (<0.1%) 47 (0.1%)
Elementary 1,647 (2.3%) 464 (1.3%) 1,183 (3.3%)
Some high school 3,447 (4.9%) 1,151 (3.3%) 2,296 (6.5%)
High school graduate 19,473 (28%) 8,407 (24%) 11,066 (31%)
Some college 20,030 (28%) 9,676 (27%) 10,354 (29%)
College graduate 26,020 (37%) 15,620 (44%) 10,400 (29%)
Income
< 10,000 3,611 (5.1%) 1,228 (3.5%) 2,383 (6.7%)
10,000 - 15,000 4,498 (6.4%) 1,412 (4.0%) 3,086 (8.7%)
15,000 - 20,000 5,557 (7.9%) 1,989 (5.6%) 3,568 (10%)
20,000 - 25,000 6,658 (9.4%) 2,604 (7.4%) 4,054 (11%)
25,000 - 35,000 8,010 (11%) 3,506 (9.9%) 4,504 (13%)
35,000 - 50,000 10,287 (15%) 4,996 (14%) 5,291 (15%)
50,000 - 75,000 11,425 (16%) 6,160 (17%) 5,265 (15%)
> 75,000 20,646 (29%) 13,451 (38%) 7,195 (20%)
1 Mean, Median (SD); n (%)

From the summary table, it can be seen that there are quite some differences between the groups of patients with and without diabetes. For example, in the group of diabetes patients, the proportion of individuals with high blood pressure, high cholesterol, heart disease or heart attack and who have difficulty walking or smoke is higher. On the other hand, the proportion of the people with diabetes eating fruits, eating vegetables or who are physically active is lower compared to the group of people without diabetes. The proportion of people with a heavy alcohol consumption is higher among people without diabetes (6.2%) than among people with diabetes (2.4%). The latter group (diabetes) scores slightly worse on mental health and considerably worse on physical health on average. They also have a higher BMI (32 vs. 28). We will look into these differences between the groups in more detail by visualizing them in the following section. Note that although Age consists of 13 age group categories, we decide to treat it as continuous given that it has interval properties. Keep in mind that the value of Age represents the age category not the actual age of an individual.

2.2 Descriptive plots

2.2.1 Marginal distributions

We examine histogram and density plots of numeric variables (i.e., Age, BMI, Physical Health, Mental Health), and bar plots for some of the categorical variables. We choose a couple of categorical variables that are deemed more interesting to explore (e.g., Education, Income).

## for each of the numeric variable, create a histogram and density plot
hist <- diabetes %>% 
   # select numeric variables
   select_if(is.numeric) %>% 
   # extract variable names
   names(.) %>% 
   # using map function to create ggplot for each variable
   map(
      ~ ggplot(diabetes, aes_string(x = .)) + 
      geom_histogram(
         aes(y=..density..), 
         color = 1, 
         fill = "white") +
      geom_density(
         color = "#158cba", 
         fill = "#158cba", 
         alpha = 0.25, 
         bw = 0.5) +
   theme_minimal())

# arrange all plots in 2 by 2 grid
plot1 <- ggarrange(plotlist=hist, nrow = 2, ncol = 2)

# add the title of figure
annotate_figure(
   plot1, 
   top = text_grob(
      "Distribution of BMI, Mental Health, Physical Health, and Age", 
      size = 14))
Marginal distribution of numeric variables

Figure 2.1: Marginal distribution of numeric variables

We see that Body Mass Index (BMI) is roughly normally distributed with a slight skew to the right. The number of days with poor mental health in 30 days prior to the survey (MentHlth) is zero for more than 60% of the individuals. All other responses are endorsed by less than 5% of participants, with a slight spike on thirty days (around 5%). The number of days with physical illness or injury in 30 days prior to the survey (PhysHlth) follows a similar pattern to mental health with the majority (>50%) having zero days with complaints. The remaining responses are usually below 10%, with the exception of thirty days which is endorsed by slightly more than 10% of individuals. The 13-level age category (Age) in the dataset is a unimodal distribution centered around ages 65-69 (the tenth response category) with a left skew.

## create a barplot for the categorical variables 
bar <- diabetes %>% 
   select(Education, Income) %>% 
   # extract variable names
   names(.) %>% 
   # using map function to create ggplot for each variable
   map(
      ~ ggplot(diabetes) + 
      geom_bar(
         aes_string(x = .), 
         color = "#158cba", 
         fill = "#158cba", 
         alpha = 0.25) +
      theme_classic()) 

# arrange all plots in 1 by 2 grid
plot2 <- ggarrange(plotlist = bar, ncol = 2)

# add the common title of figure
annotate_figure(
   plot2, 
   top = text_grob(
      "Distribution of Education and Income Categories", 
      size = 14))
Marginal distribution of categorical variables

Figure 2.2: Marginal distribution of categorical variables

The examination of the 6-level education level category (Education) reveals that the majority of the individuals are college graduates (the sixth category), followed by some college or technical school (the fifth category), and high school graduates (the fourth category). We see that lower levels of education are rarer in the dataset.

The 8-level income category (Income) shows that each increased level of income has more individuals than the preceding level. Hence, the lowest income category (<$10,000) has the fewest individuals, and each subsequent category rises with a stable increase. However, the highest income category ($75,000+) appears to comprise more individuals than is expected by the change rate observed between the subsequent categories.

2.2.2 Conditional distributions

In the following part, we explore heterogeneity between the diabetes groups in some of the variables including the ones that show lots of differences in the summary table (Table 2.1) above (e.g., proportion of high blood pressure, level of cholesterol, physical health).

## explore heterogeneity between groups using highest discriminating variables 
# create density plots for bmi and age
bmi_age <- diabetes %>% 
   select(BMI, Age) %>% 
   # extract variable names
   names(.) %>% 
   # using map function to create ggplot for each variable
   map(
      ~ ggplot(diabetes, aes_string(x = .)) +
      # density plot
      geom_density(aes(
         fill = Diabetes_binary, 
         color = Diabetes_binary), 
         alpha = 0.4, 
         bw = 0.5) +
      # add rug marks
      geom_rug(size=0.1, aes(colour = Diabetes_binary))+
      # specify our chosen color
      scale_fill_manual(
         values = c("#E69F00", "#56B4E9"), 
         labels=c("0: no", "1: yes")) +
      scale_color_manual(
         values = c("#E69F00", "#56B4E9"), 
         labels=c("0: no", "1: yes")) +
      # add the labels for legend
      labs(fill = "Diabetes", color = "Diabetes") + 
      # specify the minimal theme
      theme_minimal())

# create bar plots for blood pressure and cholesterol in proportion
bp_chol <- diabetes %>%
   select(HighBP, HighChol) %>%
   imap(~ as.data.frame(prop.table(
      table(.x, diabetes$Diabetes_binary), margin = 1)) %>%
      ggplot(aes(x = `.x`, y = Freq, col = Var2)) +
         geom_col(aes(
         fill = Var2,
         color = Var2),
         alpha = 0.4,
         width=0.4) +
      # specify our chosen color
      scale_fill_manual(
         values = c("#E69F00", "#56B4E9"),
         labels=c("0: no", "1: yes")) +
      scale_color_manual(
         values = c("#E69F00", "#56B4E9"),
         labels=c("0: no", "1: yes")) +
      # add the labels for legend
      labs(
         fill = "Diabetes", 
         color = "Diabetes", 
         x = .y,
         y = "Relative frequency") +
      # specify the classic2 theme
      theme_minimal())

# arrange plots in 2 by 2 grid
plot3 <- ggarrange(
   plotlist = c(bmi_age, bp_chol),
   ncol = 2, nrow=2, 
   common.legend = TRUE, 
   legend = "bottom")
annotate_figure(
   plot3,
   top = text_grob("Distribution of BMI, Age, High Blood Pressure and High Cholesterol per Diabetes Group", 
   size = 13))
Conditional distribution plots 1

Figure 2.3: Conditional distribution plots 1

Here we examine how the distributions of BMI and Age differ between diabetics and non-diabetics. For BMI, we see that diabetics have a higher BMI average and more observations on the right tail which is also longer. This is in accordance with our expectations. For Age, we see that diabetics tend to be older than non-diabetics which makes sense since type-2 diabetes can develop later in life, e.g., due to lifestyle choices. The bottom two plots show that the there is a higher proportion of people with high blood pressure and high cholesterol level in the diabetic group.

## conditional plots for bmi per age and mental health per sex
# boxplot of bmi per age catgegory
bmi <- diabetes %>% 
   ggplot(aes(x= as.factor(Age), y= BMI)) +
   geom_boxplot(aes(fill = Diabetes_binary), outlier.size = 0.3) +
   labs(
      title = "Distribution of BMI per Age category", 
      x = "Age (category)", 
      fill = "Diabetes") +
   scale_fill_manual(
      values= alpha(c("#E69F00", "#56B4E9"), 0.5), 
      labels=c("0: no", "1: yes"))+
   theme_minimal() + 
   theme(legend.position="none")

# boxplot of mental health per sex
mental <- diabetes %>% 
   ggplot(aes(x= Sex, y= MentHlth)) +
   geom_boxplot(aes(fill = Diabetes_binary), outlier.size = 0.1) +
   labs(title = "Distribution of Mentally Sick Days", fill = "Diabetes") +
   scale_fill_manual(
      values= alpha(c("#E69F00", "#56B4E9"), 0.5), 
      labels=c("0: no", "1: yes"))+
   scale_x_discrete(labels=c("0: Female", "1: Male")) +
   theme_minimal()

# boxplot of physical health per sex
physical <- diabetes %>% 
   ggplot(aes(x= Sex, y= PhysHlth)) +
   geom_boxplot(aes(fill = Diabetes_binary), outlier.size = 0.1) +
   labs(title = "Distribution of Physically Sick Days", fill = "Diabetes") +
   scale_fill_manual(
      values= alpha(c("#E69F00", "#56B4E9"), 0.5), 
      labels=c("0: no", "1: yes"))+
   scale_x_discrete(labels=c("0: Female", "1: Male")) +
   theme_minimal()

# arrange the plots 
ggarrange(bmi, 
   ggarrange(mental, physical, 
      nrow=1, ncol=2, common.legend = TRUE, legend = "bottom"), 
   nrow =2)
Conditional distribution plots 2

Figure 2.4: Conditional distribution plots 2

In the first plot, we see that diabetics have a higher BMI in every age category. The second plot shows that there is a greater interquartile range in mentally sick days for diabetics than for healthy people, especially for women as compared to men. Similarly, the third plot shows that diabetics have a greater interquartile range in physically sick days than healthy people which is also more prominent in women.

2.2.3 Correlation plot

Furthermore, we explore the correlation of having Diabetes with all of the other variables by creating the following barplot.

## plot indicating the correlation between outcome and all predictors
diabetes %>% 
   # convert all variables to numeric
   mutate(across(everything(), ~as.numeric(.))) %>% 
   # compute correlations
   cor(.) %>% as.data.frame() %>% 
   # select the correlations with Diabetes only
   select(Diabetes_binary) %>% 
   # rename the column
   rename(cor = Diabetes_binary) %>% 
   # drop the perfect correlation with itself
   filter(cor != 1) %>% 
   # create variables need to be mapped for the plot
   mutate(var = rownames(.),
          group = ifelse(cor > 0, "pos", "neg")) %>% 
   # create the ggplot
   ggplot(aes(reorder(x = var, -cor), y = cor, fill=group)) +
   geom_col() +
   # specify the color
   scale_fill_manual(
      values = alpha(c("#CC0000", "#000099"), 0.6), 
      labels = c("negative", "positive")) +
   # specify the labels
   labs(
      title = "Correlation of diabetes with other variables", 
      y = "Correlation with Diabetes", 
      fill = "Correlation", 
      x = "") +
   # specify the theme
   theme_minimal() + 
   # coordinates flip
   coord_flip()
Correlation between diabetes and other variables

Figure 2.5: Correlation between diabetes and other variables

From the plot, we can see that the extent to which Diabetes is correlated with the other variables varies from around -0.2 (Income) to 0.4 (GenHlth). Overall, the correlations are not that high but regardless, we expect that the variables with relatively higher correlations would be more important when predicting whether a person has diabetes or not.


3 Analytic Strategy

3.1 Model Selection

The goal of the present assignment is to compare two machine learning models of differing complexity on a diabetes prediction task. We base our model selection on the sample of models which were part of the INFOMDA1 course curriculum. Since our task is binary classification, we consider K-Nearest Neighbors (KNN), logistic regression, linear discriminant analysis (LDA); tree-based methods such as bagging, random forest (RF), and boosting; and support vector machine (SVM). There is no single way to express model complexity. We apply the following strategy:

  1. First we classify the models into parametric and non-parametric models. This is useful because one definition of model complexity is tied to its parametricity. According to Russell, Norvig, and Davis (2010), while parametric models tend to be less complex due to having a fixed number of parameters to estimate, non-parametric models are associated with higher model complexity.
Parametric Methods Non-parametric Methods
Logistic Regression K-Nearest Neighbors
Linear Discriminant Analysis Bagging
Random Forest
Boosting
Support Vector Machine
  1. Another definition of model complexity is tied to its interpretability – while simple models tend to be interpretable, complex models do not (Sarker 2021). According to this definition of complexity, our parametric models are inherently interpretable and hence can be considered simple. Our non-parametric models are less interpretable and hence more complex, with the exception of KNN which could still be considered a simple model. This is because KNN’s internal mechanism is very easy to understand. Even though it is non-parametric, it is interpretable. Hence, we classify KNN as a simple model.

Having combined the two definitions of model complexity, we finalize our list for model selection:

Simple Methods Complex Methods
Logistic Regression Bagging
Linear Discriminant Analysis Random Forest
K-Nearest Neighbors Boosting
Support Vector Machine

For the simple model, we have to decide between logistic regression, LDA and KNN. According to Gareth et al. (2013), logistic regression will outperform LDA when the normality assumption does not hold. Many predictors in our dataset are categorical. Therefore, we cannot assume normal distributions for such predictors on both levels of the outcome. For this reason we prefer logistic regression. Furthermore, our dataset is relatively high dimensional. In such circumstances, KNN often performs worse than parametric models, even when the data does not meet the parametric assumptions, e.g. is non-linear. The increase in the number of predictors affects KNN significantly in terms of the total model error (Gareth et al. 2013). For this reason, we exclude KNN, thus arriving to logistic regression as our simple model.

For the complex model, we have to choose between bagging, random forest, boosting and SVM. Bagging produces an ensemble of highly correlated trees which makes it a high-variance method, especially compared to other tree ensemble methods (Gareth et al. 2013). Since we want our final model to be low variance (and low bias), we exclude bagging. Random forest produces an ensemble of decorrelated trees. Hence, it has lower variance than bagging (Gareth et al. 2013). Boosting combines a large number of weak trees into a single decision tree. This makes the interpretability of the outcome higher for boosting than for random forest (Gareth et al. 2013). SVM is a classification algorithm based on dividing the feature space using kernels. Kernels allow for considerable flexibility in terms of the shape of the boundary between the classes (Gareth et al. 2013). However, finding the most appropriate kernel can be a challenging task. Also, SVM models are highly uninterpretable (Martin-Barragan, Lillo, and Romo 2014). For this reason, we choose boosting as our complex model.

To reiterate, our final choice is logistic regression for the simple model, and boosting for the complex model.

3.1.1 Simple Model: Logistic Regression

Logistic regression is a linear binary classifier (Gareth et al. 2013). It is considered a simple parametric model. When the true decision boundary is non-linear, logistic regression can have elevated bias compared to more complex / non-parametric models. This is due to logistic regression making a number of assumptions. In order to reduce its bias, logistic regression needs to be tuned properly. Nonetheless, we expect its accuracy to be lower than that of a complex model when dealing with data which consists of a mixture of categorical and continuous variables (Kirasich, Smith, and Sadler 2018). Being a parametric model, logistic regression makes several assumptions that need to be inspected (Kirasich, Smith, and Sadler 2018). A limitation of logistic regression is that it can overfit in high-dimensional datasets. For this reason, it is preferable to apply regularization. To this end, we make our own regularization method instead of using an already existing one. We compute feature importances by standardizing regression coefficients for every predictor.

3.1.2 Complex Model: Boosting

Boosting is an ensemble tree method which can lower variance of decision trees. Unlike bagging and random forest, boosting does not take bootstrapped samples from the data. Instead, a series of weak trees (often stumps) is grown sequentially. Each fitted tree will have some misclassifications. These misclassified observations are then given higher weights, and a new weak tree is fitted. The new stump focuses on correctly labelling the previous misclassified observations. However, each iteration often leads to another set of misclassified observations. This process is repeated a number of times, and in the final step, a large number of weak trees is combined to produce a single outcome for any new observations (Gareth et al. 2013). In order to find the optimal set of tuning parameters, we manually build a grid of hyperparameters and use 3-fold cross-validation to find the best combination. As boosting is a non-parametric method, we compute the feature importance to interpret the contribution of the features based on SHAP values (Lundberg and Lee 2017). More explanation on SHAP values will follow.

3.2 Train-Test Split

We split the dataset into a training set (80%) and a test set (20%). Within the train set, we will tune our models and we use the test set to assess their performance. Since our dataset is perfectly balanced, there is no concern with regards to imbalanced classes when randomly dividing observations into the train and test set.

## split the data into training and test set
# add train/test indicator 
diabetes %<>% 
   mutate(
      split = sample(c("train", "test"), 
      size = nrow(.), 
      replace = TRUE, 
      prob=c(0.8, 0.2)))

# training data
diabetes_train <- diabetes %>% 
   filter(split=="train") %>% 
   dplyr::select(-split)

# test data
diabetes_test <- diabetes %>% 
   filter(split=="test") %>% 
   dplyr::select(-split)
## visualize training/test dataset split
diabetes %>% 
   ggplot(aes(x = Diabetes_binary, fill = split)) +
   # create a stacked bar plot
   geom_bar(position = "stack") + 
   # choose a color scheme
   scale_fill_manual(values = alpha(c("#E69F00", "#56B4E9"), 0.5))+
   # specify the x-axis tick label
   scale_x_discrete(labels = c("No (0)", "Yes (1)")) +
   # add number of data points in the plot
   geom_text(
      stat='count', 
      aes(label = ..count..), 
      position = position_stack(vjust=0.5)) +
   # specify the labels
   labs(x = "Diabetes", title = "Train/Test Dataset Split") +
   # choose a simple theme
   theme_classic()
Training and Test dataset

Figure 3.1: Training and Test dataset

3.3 Performance Evaluation

For each model, we create a confusion matrix from which we compute a number of classification metrics such as following:

  • Accuracy
  • Kappa (Cohen’s Kappa)
  • McNemar’s test
  • Sensitivity
  • Specificity
  • Positive predicted value
  • Negative predicted value
  • Balanced accuracy
  • F1 score

In order to compare the model performance, we compare cross-validated accuracy and kappa values. For each model, we also plot the receiver operating characteristic (ROC) curve to assess the overall prediction performance and calibration plot to inspect how well the predicted probabilities correspond to observed proportions. In addition, we compare two classifiers using decision boundary plots, which can demonstrate how the classifiers divide feature space based on the outcome. In order to identify risk factors of diabetes, we compute variable importance metrics for each model. For logistic regression, we do this using the standardized regression coefficients, and for boosting we use SHAP values.


4 Analysis

4.1 Simple Model (Binary Logistic Regression)

The model development for the binary logistic regression consists of two steps. First, we perform a feature selection using nonparametric bootstrapping. Second, we apply 10-fold cross validation on the training data. Finally, we evaluate the model performance on the test data using 10-fold cross validation repeated 3 times.

4.1.1 Feature Selection

Although the glm function automatically encodes variables, we explicitly create dummy encoding here. This explicit encoding helps to simplify the feature selection pipeline.

## dummy encode the diabetes train and test set
# create dummy encoding for the training data for the glm pipeline
diabetes_2_train <- diabetes_train %>%
   mutate(Diabetes_binary = as.numeric(Diabetes_binary) - 1) %>%
   fastDummies::dummy_cols(
      remove_first_dummy = TRUE,
      remove_selected_columns = TRUE
   )

# create dummy encoding for the test data for the glm pipeline
diabetes_2_test <- diabetes_test %>%
   mutate(Diabetes_binary = as.numeric(Diabetes_binary) - 1) %>%
   fastDummies::dummy_cols(
      remove_first_dummy = TRUE,
      remove_selected_columns = TRUE
   )

From the original dataset, we created 250 bootstrap samples. To each dataset, we fit a full model (including all predictors). The model is fit using the glm() function of the stats package. From each model, we obtain the standardized coefficients. The bootstrap regression estimate \(c_k\) for each predictor \(k\) is simply the mean of the standardized coefficients.

## find the most important predictors by using bootstrap estimation
# create 250 bootstrap samples from the training data
bootstraps <- replicate(
   n = 250,
   expr = diabetes_2_train[sample(nrow(diabetes_2_train), replace = TRUE), ],
   simplify = FALSE
)

# create the formula for the binary logistic regression models
predictors <- paste(colnames(diabetes_2_train[-1]), collapse = " + ")
formula = as.formula(paste0("Diabetes_binary ~", predictors))

# fit a binary logistic regression model to each bootstrap dataset
bootstraps_fit <- bootstraps %>%
  map(~.x %>%
        glm(formula = formula,
            data = .,
            family = binomial(link = "logit"))
      )

# obtain the standardized coefficients from each model
bootstraps_coef <- bootstraps_fit %>%
  map(~.x %>% lm.beta() %>% .[["coefficients"]]) %>%
  do.call("rbind", .) %>%
  as_tibble() 

We then order the coefficients on the absolute value in descending order \((c_1, ..., c_k)\). This ordered list forms a measure of the variable importance. Then, we run several models each time increasing the model complexity. We start with a null model (intercept only) and each time we include up to \(k\) most important predictors. Finally, we stop if the full model has been reached.

## fit logistic models with the k most important predictors
# get the importance of variables based on the standardized coefficients
predictors_ordered <- bootstraps_coef %>%
   colMeans() %>%
   .[-1] %>%
   .[order(abs(.), decreasing = TRUE)] %>%
   names()

# create a list to store the models
models <- list()

# for each variable
for (k in 1:length(predictors_ordered)) {
   # select the k most important predictors
   predictors <- paste(predictors_ordered[1:k], collapse = " + ")

   # create formula for the logistic regression
   formula <- as.formula(paste0("Diabetes_binary ~", predictors))

   # fit logistic regression model
   models[k] <- glm(
      formula = formula,
      data = diabetes_2_train,
      family = binomial(link = "logit")
   ) %>% list()
}

# additionally create the null model (intercept only)
model_null <- glm(
      formula = Diabetes_binary ~ 1,
      data = diabetes_2_train,
      family = binomial(link = "logit")
) %>% list()

# combine the null model with models
models <- c(model_null, models)

For all models, we evaluate the accuracy, kappa, sensitivity, and specificity values.

## evaluate each model by obtaining the confusion matrix
# for each model obtain the confusion matrix
bootstrap_eval <- models %>%
   map(~.x %>%
      predict(., type = "response") %>%
      tibble() %>%
      mutate(
         predict = (. > 0.5), .keep = "none",
         truth = as.logical(diabetes_2_train$Diabetes_binary),
         predict = factor(predict, levels = c("TRUE", "FALSE")),
         truth = factor(truth, levels = c("TRUE", "FALSE"))
      ) %>%
      table() %>%
      caret::confusionMatrix(positive = "TRUE")
   )

# for each model get the overall results of the confusion matrix results
bootstrap_accuracy <- bootstrap_eval %>%
   sapply("[[", "overall") %>%
   t() %>%
   as.data.frame() %>%
   mutate(k = row_number() - 1)

# for each model get all other results of the confusion matrix results
bootstrap_confusion <- bootstrap_eval %>%
   sapply("[[", "byClass") %>%
   t() %>%
   as.data.frame() %>%
   mutate(k = row_number() - 1)
## plot of the evaluation metrics to select a certain k
bootstrap_accuracy %>%
   left_join(bootstrap_confusion, by = "k") %>%
   pivot_longer(
       cols = c("Accuracy", "Kappa", "Sensitivity", "Specificity"),
       names_to = "Measure",
       values_to = "Value") %>%
   ggplot(aes(x = k, y = Value, color = Measure)) +
   geom_vline(xintercept = 21, linetype = "dotted") +
   geom_point() +
   geom_line() +
   theme_minimal()
Model Complexity versus Model Performance

Figure 4.1: Model Complexity versus Model Performance

The figure above shows the model performance: accuracy, kappa, sensitivity, and specificity as a function of complexity given as \(k\). We observe that after including the 21 most important predictors, the evaluation metrics do not increase anymore. Hence, we select the coefficients from the model for which \(k = 21\). In the next section, we apply 10-fold cross validation repeated three times.

4.1.2 Cross Validation

## perform cross validation on the selected model 
# re-code the outcome variable as factor again for the caret train
diabetes_2_train <- diabetes_2_train %>%
   mutate(
      Diabetes_binary = as.factor(Diabetes_binary),
      Diabetes_binary = recode(Diabetes_binary,
         "0" = "Nondiabetic", "1" = "Diabetic"))

# specify the internal validation settings
train_control <- trainControl(
   method = "repeatedcv", 
   number = 10,
   repeats = 3,
   allowParallel = TRUE
)

# train the model on training set
model <- train(
   models[[22]][["formula"]],
   data = diabetes_2_train,
   trControl = train_control,
   method = "glm",
   family = binomial()
)
# print cross validation results
summary(model)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4533  -0.8076  -0.0912   0.8304   3.0313  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -6.714125   0.118003 -56.898  < 2e-16 ***
GenHlth_5               1.999529   0.059173  33.791  < 2e-16 ***
GenHlth_4               1.879895   0.047479  39.594  < 2e-16 ***
GenHlth_3               1.433579   0.042504  33.728  < 2e-16 ***
CholCheck_1             1.303189   0.088306  14.758  < 2e-16 ***
GenHlth_2               0.731591   0.043044  16.996  < 2e-16 ***
HvyAlcoholConsump_1    -0.727678   0.053769 -13.533  < 2e-16 ***
HighBP_1                0.712602   0.022064  32.298  < 2e-16 ***
HighChol_1              0.582029   0.021028  27.678  < 2e-16 ***
Income_8               -0.370885   0.034959 -10.609  < 2e-16 ***
Sex_1                   0.290192   0.021040  13.792  < 2e-16 ***
HeartDiseaseorAttack_1  0.256965   0.031474   8.164 3.23e-16 ***
Income_7               -0.225526   0.036676  -6.149 7.79e-10 ***
Income_6               -0.197828   0.036655  -5.397 6.77e-08 ***
Income_5               -0.166271   0.038636  -4.304 1.68e-05 ***
Stroke_1                0.188402   0.045319   4.157 3.22e-05 ***
Age                     0.152610   0.004233  36.048  < 2e-16 ***
DiffWalk_1              0.109250   0.027709   3.943 8.05e-05 ***
Income_4               -0.079990   0.040689  -1.966   0.0493 *  
Education_2             0.109552   0.071563   1.531   0.1258    
Education_6            -0.095708   0.023261  -4.115 3.88e-05 ***
BMI                     0.074232   0.001745  42.528  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 78547  on 56659  degrees of freedom
Residual deviance: 58027  on 56638  degrees of freedom
AIC: 58071

Number of Fisher Scoring iterations: 5

From the cross validated results, we can observe that

  • The positive coefficients of GenHlth_5, GenHlth_4, GenHlth_3, CholCheck_1, GenHlth_2, HighBP_1, HighChol_1, Sex_1, HeartDiseaseorAttack_1, Stroke_1, Age, DiffWalk_1, Education_2, and BMI of the logistic regression model translate into odds ratios that are greater than one. That in turn, means that the predicted probability of having diabetes increases as the covariate increases.

  • Conversely, the negative coefficients of HvyAlcoholConsump_1, Income_8, Income_7, Income_6, Income_5, Income_4, and Education_6 of the logistic regression model translate into odds ratios that are less than one. That in turn, means that the predicted probability of having diabetes decreases as the covariate increases.

  • All predictors are statistically significant at the alpha level of 0.05, except Education_2.

  • HvyAlcoholConsump_1 appears to have a protective effect against having diabetes. This result could be spurious and is probably caused by either confounding or adjustment on a collider.

# model
print(model)
Generalized Linear Model 

56660 samples
   21 predictor
    2 classes: 'Nondiabetic', 'Diabetic' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 50994, 50993, 50994, 50995, 50994, 50993, ... 
Resampling results:

  Accuracy   Kappa    
  0.7466466  0.4932949

4.1.3 Prediction Evaluation

## confusion matrix of the test data
# re-code the variables for the confusion matrix
diabetes_2_test <- diabetes_2_test %>%
   mutate(
      Diabetes_binary = as.factor(Diabetes_binary),
      Diabetes_binary = recode(Diabetes_binary,
         "0" = "Nondiabetic", "1" = "Diabetic"))

# predicted and true values for the outcome variable
pred = predict(model, newdata = diabetes_2_test)
true = diabetes_2_test$Diabetes_binary

# confusion matrix
confusionMatrix(pred, true, positive = "Diabetic")
Confusion Matrix and Statistics

             Reference
Prediction    Nondiabetic Diabetic
  Nondiabetic        5176     1603
  Diabetic           1838     5415
                                          
               Accuracy : 0.7548          
                 95% CI : (0.7476, 0.7619)
    No Information Rate : 0.5001          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5095          
                                          
 Mcnemar's Test P-Value : 6.633e-05       
                                          
            Sensitivity : 0.7716          
            Specificity : 0.7380          
         Pos Pred Value : 0.7466          
         Neg Pred Value : 0.7635          
             Prevalence : 0.5001          
         Detection Rate : 0.3859          
   Detection Prevalence : 0.5169          
      Balanced Accuracy : 0.7548          
                                          
       'Positive' Class : Diabetic        
                                          
  • The confusion matrix obtained from logistic regression model on the test data is given above. There are 5415 true positives (TP), 5176 true negatives (TN), 1838 false positives (FP), and 1603 false negatives (FN). Note, the positive category is Diabetic.

  • Accuracy = TP + TN / (TP + TN + FP + FN) = 0.75, with the 95% CI [0.748, 0.762]. The “no-information rate” is roughly 0.50, given the balanced dataset. We can accept the hypothesis that the accuracy is greater than the no information rate (\(p < .001\)).

  • Kappa (Cohen’s Kappa) gives the estimate of level of agreement; Kappa = \((p_{o} - p_{e} )/(1 - p_{e})\), where \(p_{o}\) is observed proportion of agreement and \(p_{e}\) is expected proportion of agreement (baseline agreement). Kappa = 1 means perfect agreement, whereas Kappa = 0 means observed agreement is only due to chance. There is no such explicit guidelines on interpreting the magnitude of Kappa but in general, Kappa value between 0.41 and 0.60 are considered as moderate agreement (Landis and Koch 1977). Based on that, we conclude that there is a moderate agreement in our case given that Kappa value is 0.51.

  • McNemar’s test also checks for the agreement between observed and predicted proportion of “Yes” class. The null hypothesis is that there is no disagreement. Based on the fact that the corresponding p-value is very small (\(p <.001\)), we reject the null hypothesis and conclude that there is a disagreement between observed and predicted proportion. This is not surprising as there is quite a portion of mis-predicted cases.

  • Sensitivity = TP / (TP + FN) = 0.77, meaning that if one has diabetes, then there is 77% probability that the model will detect this.

  • Specificity = TN / (TN + FP) = 0.74, meaning that if one doesn’t have diabetes, then there is 74% probability that the model will detect this.

  • Positive predicted value (PPV) = TP / (FP + TP) = 0.75, meaning that if one is predicted to have diabetes, then there is 75% probability that the person actually has diabetes.

  • Negative predicted value (NPV) = TN / (TN + FN) = 0.76, meaning that if one is predicted to not have diabetes, then there is 76% probability that the person does not actually have diabetes.

  • Balanced accuracy = (sensitivity + specificity)/2 = 0.75.

  • F1 score = 2 \(\cdot\) (precision \(\cdot\) recall) / (precision + recall) = 0.75. Both the balanced accuracy and F1 score are in agreement the accuracy value. That is not surprising given the balanced dataset.

4.1.4 Variance Importance

The variable importance can be estimated from the z-value of each coefficient.

## check variable importance (top 15)
varImp(model, scale=TRUE)[["importance"]] %>% 
   rename("Importance" = "Overall") %>%
   # subset the top 15
   slice_max(Importance, n = 15) %>%
   # create ggplot for importance 
   ggplot(aes(
      x = reorder(rownames(.), Importance), 
      y = Importance, 
      fill = Importance)) + 
   geom_bar(stat="identity")+
   # add the importance value
   geom_text(
      aes(label = round(Importance, 2), col = Importance), 
      hjust = -0.2) +
   # specify the gradient color
   scale_fill_gradient(low = "grey40", high = "#28B463") + 
   scale_color_gradient(low = "grey40", high = "#28B463") +
   scale_y_continuous(limits = c(0, 104)) +
   coord_flip() + theme_minimal() +
   # edit the labels
   theme(legend.position = "none", axis.title.y = element_blank()) +
   labs(title = "Logistic regression - Variable Importance")

The variable importance plot shows that BMI, Age, GenHlth_3, GenHlth_4, GenHlth_5, HighBP_1, and HighChol_1 are important predictors. It is interesting to note that these variables correspond to the variables that are highly correlated as we saw earlier in the correlation plot in the section 2.2.3.

4.1.5 Model Assumptions

In this section, we check the model assumptions. The logistic regression model assumes that:

  • The outcome is dichotomous
  • Independence of observations
  • Linearity assumption
  • Influential values
  • Multicollinearity

Dichotomous Outcome The outcome Diabetes_binary is indeed binary.

# linearity assumption
diabetes_2_test %>% 
   select(Age, BMI) %>%
   mutate(Prob = predict(
      model, 
      newdata = diabetes_2_test, 
      type = "prob")$Diabetic,
          Odds = Prob / (1 - Prob), 
          Logit = log(Odds)) %>%
   gather(key = "Predictor", value = "Value", -c(Logit, Prob, Odds)) %>%
   ggplot(aes(x = Value, y = Logit)) + 
   geom_point(size = 0.5, alpha = 0.5) +
   geom_smooth(method = "loess") + 
   theme_bw() + 
   facet_wrap(~ Predictor, scales = "free_x")
Relationship between Continious Predictors and the Logit of the Outcome

Figure 4.2: Relationship between Continious Predictors and the Logit of the Outcome

Linearity assumption The scatter plots show that Age and BMI are roughly linearly associated with the outcome in logit scale.

Independence of observations The standardized residuals appear to follow no particular pattern. This evidence supports that the observations are independent.

Influential values There are no absolute standardized residuals above 3. Hence, there is support that there are no influential observations in our data.

# independence + outliers
broom::augment(model[["finalModel"]]) %>% 
   mutate(
      Index = 1:n(),
      Index = abs(Index - median(Index))) %>%
   ggplot(aes(Index, .std.resid)) + 
   geom_point(aes(color = .outcome), alpha = 0.2) +
   labs(y = "Std Residuals") + 
   theme_bw()
Standardized Residuals

Figure 4.3: Standardized Residuals

Multicollinearity There appears to be no multicollinearity as there is no VIF value that exceeds the value 5.

# multicollinearity
car::vif(model[["finalModel"]]) 
             GenHlth_5              GenHlth_4              GenHlth_3 
              2.138828               3.226958               4.128356 
           CholCheck_1              GenHlth_2    HvyAlcoholConsump_1 
              1.005731               3.711381               1.006313 
              HighBP_1             HighChol_1               Income_8 
              1.128100               1.055671               2.365394 
                 Sex_1 HeartDiseaseorAttack_1               Income_7 
              1.062983               1.124256               1.800788 
              Income_6               Income_5               Stroke_1 
              1.678179               1.502817               1.064292 
                   Age             DiffWalk_1               Income_4 
              1.200149               1.338805               1.394317 
           Education_2            Education_6                    BMI 
              1.041588               1.197418               1.110791 

4.2 Complex Model (Gradient Boosting)

We use XGBoost algorithm, which stands for eXtreme Gradient Boosting to build a boosting model. As implied in its name, it utilizes the gradient boosting technique by adding more and more weak learners until no further improvement can be made.

4.2.1 Hyperparameters of eXtreme Gradient Boosting:

  • nrounds: number of iterations, that is, how often the algorithm tries to iteratively improves the fit based on the result from the previous round. The higher nround is, the more complicated the resulting model is going to be (fitted to the training dataset).
  • eta: step size shrinkage is to prevent overfitting, which is analogous to learning rate. The higher eta, the faster the algorithm will fit.
  • max_depth: maximum depth of a tree, which can be used to control overfitting. Higher max_depth allows model to become more complex. It will be tuned using CV.
  • gamma: minimum loss reduction required to make a split. The larger gamma is, the more conservative the algorithm will be.
  • colsample_bytree: the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.
  • subsample: fraction of observations to be randomly sampled for each tree. Setting it to 0.5 means that the xgboost algorithm would randomly sample half of the training dataset prior to growing trees and that will correspondingly prevent overfitting. Lower values make the algorithm more conservative and prevents overfitting.
  • min_child_weight: minimum sum of weights of all observations required in a child. It is also used to control overfitting such that higher values prevent a model from learning highly specific relations in the training dataset.

4.2.2 Grid Search for Hyperparameters

We build a tuning grid for the algorithm to explore between different hyperparameter values. To get started, we use some suggestions from this tutorial.

## grid search: set up the hyper-parameter search pool
# there are in total 7 hyperparameters for eXtreme Grading Boosting algorithms
xgb_grid = expand.grid(
   # number of boosting iterations
   nrounds = c(500, 1000, 1500),
   # shrinkage parameter
   eta = c(0.1, 0.3, 0.5),
   # max tree depth
   max_depth = c(2, 4, 6),
   # minimum loss reduction
   gamma = c(0, 10),
   # subsample ratio of columns
   colsample_bytree = c(0.5, 1),
   # subsample percentage
   subsample = c(0.5, 1),
   # minimum sum of instance weight
   min_child_weight = c(0, 20)
)

# specify the training control parameters 
cvcontrol <- trainControl(
   # 3-fold cv
   method = "cv",
   number = 3,
   # save the performance measures for all resample
   returnResamp = "all",
   # save the class probabilities for each resample
   classProbs = TRUE)

# have to change the levels of Diabetes_binary (output variable) to character instead of number (0, 1)
# (caret package specific)
levels(diabetes_train$Diabetes_binary) = c("No","Yes")

# train the model for each parameter combination in the grid
# using CV specified as per training control parameters
xgboost_tune <- train(Diabetes_binary ~ ., 
                       data = diabetes_train,
                       # method: eXtreme gradient boosting
                       method = "xgbTree",
                       # loss function to be minimized: (binary:logistic) 
                       objective = "binary:logistic",
                       # apply our training control specifics
                       trControl = cvcontrol,
                       # search the specified grid above
                       tuneGrid = xgb_grid)

# save the result as Rdata since it takes very long to run
save(xgboost_tune, file = "xgb_tune.RData")

4.2.3 Tuning result

## load the xgb tuning result
load("Data/Xgb_tune.RData")
# tuning results in detail
print(xgboost_tune)
See the parameter tuning results.
eXtreme Gradient Boosting 

56660 samples
   21 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (3 fold) 
Summary of sample sizes: 37773, 37774, 37773 
Resampling results across tuning parameters:

  eta  max_depth  gamma  colsample_bytree  min_child_weight  subsample  nrounds
  0.1  2           0     0.5                0                0.5         500   
  0.1  2           0     0.5                0                0.5        1000   
  0.1  2           0     0.5                0                0.5        1500   
  0.1  2           0     0.5                0                1.0         500   
  0.1  2           0     0.5                0                1.0        1000   
  0.1  2           0     0.5                0                1.0        1500   
  0.1  2           0     0.5               20                0.5         500   
  0.1  2           0     0.5               20                0.5        1000   
  0.1  2           0     0.5               20                0.5        1500   
  0.1  2           0     0.5               20                1.0         500   
  0.1  2           0     0.5               20                1.0        1000   
  0.1  2           0     0.5               20                1.0        1500   
  0.1  2           0     1.0                0                0.5         500   
  0.1  2           0     1.0                0                0.5        1000   
  0.1  2           0     1.0                0                0.5        1500   
  0.1  2           0     1.0                0                1.0         500   
  0.1  2           0     1.0                0                1.0        1000   
  0.1  2           0     1.0                0                1.0        1500   
  0.1  2           0     1.0               20                0.5         500   
  0.1  2           0     1.0               20                0.5        1000   
  0.1  2           0     1.0               20                0.5        1500   
  0.1  2           0     1.0               20                1.0         500   
  0.1  2           0     1.0               20                1.0        1000   
  0.1  2           0     1.0               20                1.0        1500   
  0.1  2          10     0.5                0                0.5         500   
  0.1  2          10     0.5                0                0.5        1000   
  0.1  2          10     0.5                0                0.5        1500   
  0.1  2          10     0.5                0                1.0         500   
  0.1  2          10     0.5                0                1.0        1000   
  0.1  2          10     0.5                0                1.0        1500   
  0.1  2          10     0.5               20                0.5         500   
  0.1  2          10     0.5               20                0.5        1000   
  0.1  2          10     0.5               20                0.5        1500   
  0.1  2          10     0.5               20                1.0         500   
  0.1  2          10     0.5               20                1.0        1000   
  0.1  2          10     0.5               20                1.0        1500   
  0.1  2          10     1.0                0                0.5         500   
  0.1  2          10     1.0                0                0.5        1000   
  0.1  2          10     1.0                0                0.5        1500   
  0.1  2          10     1.0                0                1.0         500   
  0.1  2          10     1.0                0                1.0        1000   
  0.1  2          10     1.0                0                1.0        1500   
  0.1  2          10     1.0               20                0.5         500   
  0.1  2          10     1.0               20                0.5        1000   
  0.1  2          10     1.0               20                0.5        1500   
  0.1  2          10     1.0               20                1.0         500   
  0.1  2          10     1.0               20                1.0        1000   
  0.1  2          10     1.0               20                1.0        1500   
  0.1  4           0     0.5                0                0.5         500   
  0.1  4           0     0.5                0                0.5        1000   
  0.1  4           0     0.5                0                0.5        1500   
  0.1  4           0     0.5                0                1.0         500   
  0.1  4           0     0.5                0                1.0        1000   
  0.1  4           0     0.5                0                1.0        1500   
  0.1  4           0     0.5               20                0.5         500   
  0.1  4           0     0.5               20                0.5        1000   
  0.1  4           0     0.5               20                0.5        1500   
  0.1  4           0     0.5               20                1.0         500   
  0.1  4           0     0.5               20                1.0        1000   
  0.1  4           0     0.5               20                1.0        1500   
  0.1  4           0     1.0                0                0.5         500   
  0.1  4           0     1.0                0                0.5        1000   
  0.1  4           0     1.0                0                0.5        1500   
  0.1  4           0     1.0                0                1.0         500   
  0.1  4           0     1.0                0                1.0        1000   
  0.1  4           0     1.0                0                1.0        1500   
  0.1  4           0     1.0               20                0.5         500   
  0.1  4           0     1.0               20                0.5        1000   
  0.1  4           0     1.0               20                0.5        1500   
  0.1  4           0     1.0               20                1.0         500   
  0.1  4           0     1.0               20                1.0        1000   
  0.1  4           0     1.0               20                1.0        1500   
  0.1  4          10     0.5                0                0.5         500   
  0.1  4          10     0.5                0                0.5        1000   
  0.1  4          10     0.5                0                0.5        1500   
  0.1  4          10     0.5                0                1.0         500   
  0.1  4          10     0.5                0                1.0        1000   
  0.1  4          10     0.5                0                1.0        1500   
  0.1  4          10     0.5               20                0.5         500   
  0.1  4          10     0.5               20                0.5        1000   
  0.1  4          10     0.5               20                0.5        1500   
  0.1  4          10     0.5               20                1.0         500   
  0.1  4          10     0.5               20                1.0        1000   
  0.1  4          10     0.5               20                1.0        1500   
  0.1  4          10     1.0                0                0.5         500   
  0.1  4          10     1.0                0                0.5        1000   
  0.1  4          10     1.0                0                0.5        1500   
  0.1  4          10     1.0                0                1.0         500   
  0.1  4          10     1.0                0                1.0        1000   
  0.1  4          10     1.0                0                1.0        1500   
  0.1  4          10     1.0               20                0.5         500   
  0.1  4          10     1.0               20                0.5        1000   
  0.1  4          10     1.0               20                0.5        1500   
  0.1  4          10     1.0               20                1.0         500   
  0.1  4          10     1.0               20                1.0        1000   
  0.1  4          10     1.0               20                1.0        1500   
  0.1  6           0     0.5                0                0.5         500   
  0.1  6           0     0.5                0                0.5        1000   
  0.1  6           0     0.5                0                0.5        1500   
  0.1  6           0     0.5                0                1.0         500   
  0.1  6           0     0.5                0                1.0        1000   
  0.1  6           0     0.5                0                1.0        1500   
  0.1  6           0     0.5               20                0.5         500   
  0.1  6           0     0.5               20                0.5        1000   
  0.1  6           0     0.5               20                0.5        1500   
  0.1  6           0     0.5               20                1.0         500   
  0.1  6           0     0.5               20                1.0        1000   
  0.1  6           0     0.5               20                1.0        1500   
  0.1  6           0     1.0                0                0.5         500   
  0.1  6           0     1.0                0                0.5        1000   
  0.1  6           0     1.0                0                0.5        1500   
  0.1  6           0     1.0                0                1.0         500   
  0.1  6           0     1.0                0                1.0        1000   
  0.1  6           0     1.0                0                1.0        1500   
  0.1  6           0     1.0               20                0.5         500   
  0.1  6           0     1.0               20                0.5        1000   
  0.1  6           0     1.0               20                0.5        1500   
  0.1  6           0     1.0               20                1.0         500   
  0.1  6           0     1.0               20                1.0        1000   
  0.1  6           0     1.0               20                1.0        1500   
  0.1  6          10     0.5                0                0.5         500   
  0.1  6          10     0.5                0                0.5        1000   
  0.1  6          10     0.5                0                0.5        1500   
  0.1  6          10     0.5                0                1.0         500   
  0.1  6          10     0.5                0                1.0        1000   
  0.1  6          10     0.5                0                1.0        1500   
  0.1  6          10     0.5               20                0.5         500   
  0.1  6          10     0.5               20                0.5        1000   
  0.1  6          10     0.5               20                0.5        1500   
  0.1  6          10     0.5               20                1.0         500   
  0.1  6          10     0.5               20                1.0        1000   
  0.1  6          10     0.5               20                1.0        1500   
  0.1  6          10     1.0                0                0.5         500   
  0.1  6          10     1.0                0                0.5        1000   
  0.1  6          10     1.0                0                0.5        1500   
  0.1  6          10     1.0                0                1.0         500   
  0.1  6          10     1.0                0                1.0        1000   
  0.1  6          10     1.0                0                1.0        1500   
  0.1  6          10     1.0               20                0.5         500   
  0.1  6          10     1.0               20                0.5        1000   
  0.1  6          10     1.0               20                0.5        1500   
  0.1  6          10     1.0               20                1.0         500   
  0.1  6          10     1.0               20                1.0        1000   
  0.1  6          10     1.0               20                1.0        1500   
  0.3  2           0     0.5                0                0.5         500   
  0.3  2           0     0.5                0                0.5        1000   
  0.3  2           0     0.5                0                0.5        1500   
  0.3  2           0     0.5                0                1.0         500   
  0.3  2           0     0.5                0                1.0        1000   
  0.3  2           0     0.5                0                1.0        1500   
  0.3  2           0     0.5               20                0.5         500   
  0.3  2           0     0.5               20                0.5        1000   
  0.3  2           0     0.5               20                0.5        1500   
  0.3  2           0     0.5               20                1.0         500   
  0.3  2           0     0.5               20                1.0        1000   
  0.3  2           0     0.5               20                1.0        1500   
  0.3  2           0     1.0                0                0.5         500   
  0.3  2           0     1.0                0                0.5        1000   
  0.3  2           0     1.0                0                0.5        1500   
  0.3  2           0     1.0                0                1.0         500   
  0.3  2           0     1.0                0                1.0        1000   
  0.3  2           0     1.0                0                1.0        1500   
  0.3  2           0     1.0               20                0.5         500   
  0.3  2           0     1.0               20                0.5        1000   
  0.3  2           0     1.0               20                0.5        1500   
  0.3  2           0     1.0               20                1.0         500   
  0.3  2           0     1.0               20                1.0        1000   
  0.3  2           0     1.0               20                1.0        1500   
  0.3  2          10     0.5                0                0.5         500   
  0.3  2          10     0.5                0                0.5        1000   
  0.3  2          10     0.5                0                0.5        1500   
  0.3  2          10     0.5                0                1.0         500   
  0.3  2          10     0.5                0                1.0        1000   
  0.3  2          10     0.5                0                1.0        1500   
  0.3  2          10     0.5               20                0.5         500   
  0.3  2          10     0.5               20                0.5        1000   
  0.3  2          10     0.5               20                0.5        1500   
  0.3  2          10     0.5               20                1.0         500   
  0.3  2          10     0.5               20                1.0        1000   
  0.3  2          10     0.5               20                1.0        1500   
  0.3  2          10     1.0                0                0.5         500   
  0.3  2          10     1.0                0                0.5        1000   
  0.3  2          10     1.0                0                0.5        1500   
  0.3  2          10     1.0                0                1.0         500   
  0.3  2          10     1.0                0                1.0        1000   
  0.3  2          10     1.0                0                1.0        1500   
  0.3  2          10     1.0               20                0.5         500   
  0.3  2          10     1.0               20                0.5        1000   
  0.3  2          10     1.0               20                0.5        1500   
  0.3  2          10     1.0               20                1.0         500   
  0.3  2          10     1.0               20                1.0        1000   
  0.3  2          10     1.0               20                1.0        1500   
  0.3  4           0     0.5                0                0.5         500   
  0.3  4           0     0.5                0                0.5        1000   
  0.3  4           0     0.5                0                0.5        1500   
  0.3  4           0     0.5                0                1.0         500   
  0.3  4           0     0.5                0                1.0        1000   
  0.3  4           0     0.5                0                1.0        1500   
  0.3  4           0     0.5               20                0.5         500   
  0.3  4           0     0.5               20                0.5        1000   
  0.3  4           0     0.5               20                0.5        1500   
  0.3  4           0     0.5               20                1.0         500   
  0.3  4           0     0.5               20                1.0        1000   
  0.3  4           0     0.5               20                1.0        1500   
  0.3  4           0     1.0                0                0.5         500   
  0.3  4           0     1.0                0                0.5        1000   
  0.3  4           0     1.0                0                0.5        1500   
  0.3  4           0     1.0                0                1.0         500   
  0.3  4           0     1.0                0                1.0        1000   
  0.3  4           0     1.0                0                1.0        1500   
  0.3  4           0     1.0               20                0.5         500   
  0.3  4           0     1.0               20                0.5        1000   
  0.3  4           0     1.0               20                0.5        1500   
  0.3  4           0     1.0               20                1.0         500   
  0.3  4           0     1.0               20                1.0        1000   
  0.3  4           0     1.0               20                1.0        1500   
  0.3  4          10     0.5                0                0.5         500   
  0.3  4          10     0.5                0                0.5        1000   
  0.3  4          10     0.5                0                0.5        1500   
  0.3  4          10     0.5                0                1.0         500   
  0.3  4          10     0.5                0                1.0        1000   
  0.3  4          10     0.5                0                1.0        1500   
  0.3  4          10     0.5               20                0.5         500   
  0.3  4          10     0.5               20                0.5        1000   
  0.3  4          10     0.5               20                0.5        1500   
  0.3  4          10     0.5               20                1.0         500   
  0.3  4          10     0.5               20                1.0        1000   
  0.3  4          10     0.5               20                1.0        1500   
  0.3  4          10     1.0                0                0.5         500   
  0.3  4          10     1.0                0                0.5        1000   
  0.3  4          10     1.0                0                0.5        1500   
  0.3  4          10     1.0                0                1.0         500   
  0.3  4          10     1.0                0                1.0        1000   
  0.3  4          10     1.0                0                1.0        1500   
  0.3  4          10     1.0               20                0.5         500   
  0.3  4          10     1.0               20                0.5        1000   
  0.3  4          10     1.0               20                0.5        1500   
  0.3  4          10     1.0               20                1.0         500   
  0.3  4          10     1.0               20                1.0        1000   
  0.3  4          10     1.0               20                1.0        1500   
  0.3  6           0     0.5                0                0.5         500   
  0.3  6           0     0.5                0                0.5        1000   
  0.3  6           0     0.5                0                0.5        1500   
  0.3  6           0     0.5                0                1.0         500   
  0.3  6           0     0.5                0                1.0        1000   
  0.3  6           0     0.5                0                1.0        1500   
  0.3  6           0     0.5               20                0.5         500   
  0.3  6           0     0.5               20                0.5        1000   
  0.3  6           0     0.5               20                0.5        1500   
  0.3  6           0     0.5               20                1.0         500   
  0.3  6           0     0.5               20                1.0        1000   
  0.3  6           0     0.5               20                1.0        1500   
  0.3  6           0     1.0                0                0.5         500   
  0.3  6           0     1.0                0                0.5        1000   
  0.3  6           0     1.0                0                0.5        1500   
  0.3  6           0     1.0                0                1.0         500   
  0.3  6           0     1.0                0                1.0        1000   
  0.3  6           0     1.0                0                1.0        1500   
  0.3  6           0     1.0               20                0.5         500   
  0.3  6           0     1.0               20                0.5        1000   
  0.3  6           0     1.0               20                0.5        1500   
  0.3  6           0     1.0               20                1.0         500   
  0.3  6           0     1.0               20                1.0        1000   
  0.3  6           0     1.0               20                1.0        1500   
  0.3  6          10     0.5                0                0.5         500   
  0.3  6          10     0.5                0                0.5        1000   
  0.3  6          10     0.5                0                0.5        1500   
  0.3  6          10     0.5                0                1.0         500   
  0.3  6          10     0.5                0                1.0        1000   
  0.3  6          10     0.5                0                1.0        1500   
  0.3  6          10     0.5               20                0.5         500   
  0.3  6          10     0.5               20                0.5        1000   
  0.3  6          10     0.5               20                0.5        1500   
  0.3  6          10     0.5               20                1.0         500   
  0.3  6          10     0.5               20                1.0        1000   
  0.3  6          10     0.5               20                1.0        1500   
  0.3  6          10     1.0                0                0.5         500   
  0.3  6          10     1.0                0                0.5        1000   
  0.3  6          10     1.0                0                0.5        1500   
  0.3  6          10     1.0                0                1.0         500   
  0.3  6          10     1.0                0                1.0        1000   
  0.3  6          10     1.0                0                1.0        1500   
  0.3  6          10     1.0               20                0.5         500   
  0.3  6          10     1.0               20                0.5        1000   
  0.3  6          10     1.0               20                0.5        1500   
  0.3  6          10     1.0               20                1.0         500   
  0.3  6          10     1.0               20                1.0        1000   
  0.3  6          10     1.0               20                1.0        1500   
  0.5  2           0     0.5                0                0.5         500   
  0.5  2           0     0.5                0                0.5        1000   
  0.5  2           0     0.5                0                0.5        1500   
  0.5  2           0     0.5                0                1.0         500   
  0.5  2           0     0.5                0                1.0        1000   
  0.5  2           0     0.5                0                1.0        1500   
  0.5  2           0     0.5               20                0.5         500   
  0.5  2           0     0.5               20                0.5        1000   
  0.5  2           0     0.5               20                0.5        1500   
  0.5  2           0     0.5               20                1.0         500   
  0.5  2           0     0.5               20                1.0        1000   
  0.5  2           0     0.5               20                1.0        1500   
  0.5  2           0     1.0                0                0.5         500   
  0.5  2           0     1.0                0                0.5        1000   
  0.5  2           0     1.0                0                0.5        1500   
  0.5  2           0     1.0                0                1.0         500   
  0.5  2           0     1.0                0                1.0        1000   
  0.5  2           0     1.0                0                1.0        1500   
  0.5  2           0     1.0               20                0.5         500   
  0.5  2           0     1.0               20                0.5        1000   
  0.5  2           0     1.0               20                0.5        1500   
  0.5  2           0     1.0               20                1.0         500   
  0.5  2           0     1.0               20                1.0        1000   
  0.5  2           0     1.0               20                1.0        1500   
  0.5  2          10     0.5                0                0.5         500   
  0.5  2          10     0.5                0                0.5        1000   
  0.5  2          10     0.5                0                0.5        1500   
  0.5  2          10     0.5                0                1.0         500   
  0.5  2          10     0.5                0                1.0        1000   
  0.5  2          10     0.5                0                1.0        1500   
  0.5  2          10     0.5               20                0.5         500   
  0.5  2          10     0.5               20                0.5        1000   
  0.5  2          10     0.5               20                0.5        1500   
  0.5  2          10     0.5               20                1.0         500   
  0.5  2          10     0.5               20                1.0        1000   
  0.5  2          10     0.5               20                1.0        1500   
  0.5  2          10     1.0                0                0.5         500   
  0.5  2          10     1.0                0                0.5        1000   
  0.5  2          10     1.0                0                0.5        1500   
  0.5  2          10     1.0                0                1.0         500   
  0.5  2          10     1.0                0                1.0        1000   
  0.5  2          10     1.0                0                1.0        1500   
  0.5  2          10     1.0               20                0.5         500   
  0.5  2          10     1.0               20                0.5        1000   
  0.5  2          10     1.0               20                0.5        1500   
  0.5  2          10     1.0               20                1.0         500   
  0.5  2          10     1.0               20                1.0        1000   
  0.5  2          10     1.0               20                1.0        1500   
  0.5  4           0     0.5                0                0.5         500   
  0.5  4           0     0.5                0                0.5        1000   
  0.5  4           0     0.5                0                0.5        1500   
  0.5  4           0     0.5                0                1.0         500   
  0.5  4           0     0.5                0                1.0        1000   
  0.5  4           0     0.5                0                1.0        1500   
  0.5  4           0     0.5               20                0.5         500   
  0.5  4           0     0.5               20                0.5        1000   
  0.5  4           0     0.5               20                0.5        1500   
  0.5  4           0     0.5               20                1.0         500   
  0.5  4           0     0.5               20                1.0        1000   
  0.5  4           0     0.5               20                1.0        1500   
  0.5  4           0     1.0                0                0.5         500   
  0.5  4           0     1.0                0                0.5        1000   
  0.5  4           0     1.0                0                0.5        1500   
  0.5  4           0     1.0                0                1.0         500   
  0.5  4           0     1.0                0                1.0        1000   
  0.5  4           0     1.0                0                1.0        1500   
  0.5  4           0     1.0               20                0.5         500   
  0.5  4           0     1.0               20                0.5        1000   
  0.5  4           0     1.0               20                0.5        1500   
  0.5  4           0     1.0               20                1.0         500   
  0.5  4           0     1.0               20                1.0        1000   
  0.5  4           0     1.0               20                1.0        1500   
  0.5  4          10     0.5                0                0.5         500   
  0.5  4          10     0.5                0                0.5        1000   
  0.5  4          10     0.5                0                0.5        1500   
  0.5  4          10     0.5                0                1.0         500   
  0.5  4          10     0.5                0                1.0        1000   
  0.5  4          10     0.5                0                1.0        1500   
  0.5  4          10     0.5               20                0.5         500   
  0.5  4          10     0.5               20                0.5        1000   
  0.5  4          10     0.5               20                0.5        1500   
  0.5  4          10     0.5               20                1.0         500   
  0.5  4          10     0.5               20                1.0        1000   
  0.5  4          10     0.5               20                1.0        1500   
  0.5  4          10     1.0                0                0.5         500   
  0.5  4          10     1.0                0                0.5        1000   
  0.5  4          10     1.0                0                0.5        1500   
  0.5  4          10     1.0                0                1.0         500   
  0.5  4          10     1.0                0                1.0        1000   
  0.5  4          10     1.0                0                1.0        1500   
  0.5  4          10     1.0               20                0.5         500   
  0.5  4          10     1.0               20                0.5        1000   
  0.5  4          10     1.0               20                0.5        1500   
  0.5  4          10     1.0               20                1.0         500   
  0.5  4          10     1.0               20                1.0        1000   
  0.5  4          10     1.0               20                1.0        1500   
  0.5  6           0     0.5                0                0.5         500   
  0.5  6           0     0.5                0                0.5        1000   
  0.5  6           0     0.5                0                0.5        1500   
  0.5  6           0     0.5                0                1.0         500   
  0.5  6           0     0.5                0                1.0        1000   
  0.5  6           0     0.5                0                1.0        1500   
  0.5  6           0     0.5               20                0.5         500   
  0.5  6           0     0.5               20                0.5        1000   
  0.5  6           0     0.5               20                0.5        1500   
  0.5  6           0     0.5               20                1.0         500   
  0.5  6           0     0.5               20                1.0        1000   
  0.5  6           0     0.5               20                1.0        1500   
  0.5  6           0     1.0                0                0.5         500   
  0.5  6           0     1.0                0                0.5        1000   
  0.5  6           0     1.0                0                0.5        1500   
  0.5  6           0     1.0                0                1.0         500   
  0.5  6           0     1.0                0                1.0        1000   
  0.5  6           0     1.0                0                1.0        1500   
  0.5  6           0     1.0               20                0.5         500   
  0.5  6           0     1.0               20                0.5        1000   
  0.5  6           0     1.0               20                0.5        1500   
  0.5  6           0     1.0               20                1.0         500   
  0.5  6           0     1.0               20                1.0        1000   
  0.5  6           0     1.0               20                1.0        1500   
  0.5  6          10     0.5                0                0.5         500   
  0.5  6          10     0.5                0                0.5        1000   
  0.5  6          10     0.5                0                0.5        1500   
  0.5  6          10     0.5                0                1.0         500   
  0.5  6          10     0.5                0                1.0        1000   
  0.5  6          10     0.5                0                1.0        1500   
  0.5  6          10     0.5               20                0.5         500   
  0.5  6          10     0.5               20                0.5        1000   
  0.5  6          10     0.5               20                0.5        1500   
  0.5  6          10     0.5               20                1.0         500   
  0.5  6          10     0.5               20                1.0        1000   
  0.5  6          10     0.5               20                1.0        1500   
  0.5  6          10     1.0                0                0.5         500   
  0.5  6          10     1.0                0                0.5        1000   
  0.5  6          10     1.0                0                0.5        1500   
  0.5  6          10     1.0                0                1.0         500   
  0.5  6          10     1.0                0                1.0        1000   
  0.5  6          10     1.0                0                1.0        1500   
  0.5  6          10     1.0               20                0.5         500   
  0.5  6          10     1.0               20                0.5        1000   
  0.5  6          10     1.0               20                0.5        1500   
  0.5  6          10     1.0               20                1.0         500   
  0.5  6          10     1.0               20                1.0        1000   
  0.5  6          10     1.0               20                1.0        1500   
  Accuracy   Kappa    
  0.7511472  0.5022971
  0.7507942  0.5015911
  0.7512178  0.5024384
  0.7504412  0.5008851
  0.7495411  0.4990849
  0.7497176  0.4994380
  0.7511119  0.5022264
  0.7512177  0.5024384
  0.7516237  0.5032502
  0.7506530  0.5013087
  0.7500000  0.5000028
  0.7504941  0.5009912
  0.7507236  0.5014499
  0.7507589  0.5015205
  0.7497705  0.4995438
  0.7506353  0.5012733
  0.7503177  0.5006381
  0.7494882  0.4989791
  0.7512707  0.5025441
  0.7512178  0.5024384
  0.7501588  0.5003204
  0.7505471  0.5010968
  0.7498411  0.4996850
  0.7493117  0.4986261
  0.7505471  0.5010964
  0.7503530  0.5007083
  0.7507413  0.5014849
  0.7491705  0.4983430
  0.7491352  0.4982724
  0.7491352  0.4982724
  0.7501059  0.5002139
  0.7501412  0.5002846
  0.7504059  0.5008141
  0.7488528  0.4977076
  0.7489057  0.4978135
  0.7489763  0.4979547
  0.7506001  0.5012024
  0.7502294  0.5004611
  0.7508824  0.5017673
  0.7483233  0.4966484
  0.7483233  0.4966484
  0.7483233  0.4966484
  0.7504059  0.5008140
  0.7505471  0.5010965
  0.7502647  0.5005318
  0.7484645  0.4969308
  0.7484645  0.4969308
  0.7484645  0.4969308
  0.7496293  0.4992616
  0.7470173  0.4940372
  0.7430285  0.4860598
  0.7503882  0.5007795
  0.7478468  0.4956964
  0.7463466  0.4926961
  0.7500000  0.5000028
  0.7471937  0.4943902
  0.7438934  0.4877893
  0.7503882  0.5007795
  0.7483586  0.4967201
  0.7464525  0.4929078
  0.7484469  0.4968967
  0.7439463  0.4878954
  0.7421814  0.4843653
  0.7497705  0.4995439
  0.7482703  0.4965435
  0.7448111  0.4896249
  0.7480056  0.4960140
  0.7450758  0.4901545
  0.7437169  0.4874363
  0.7493469  0.4986968
  0.7476350  0.4952727
  0.7460995  0.4922017
  0.7507413  0.5014852
  0.7505118  0.5010264
  0.7506353  0.5012735
  0.7506177  0.5012381
  0.7508824  0.5017676
  0.7510060  0.5020147
  0.7507236  0.5014498
  0.7514119  0.5028265
  0.7511648  0.5023323
  0.7504765  0.5009556
  0.7507412  0.5014851
  0.7507589  0.5015204
  0.7502118  0.5004263
  0.7505824  0.5011676
  0.7503706  0.5007441
  0.7492411  0.4984846
  0.7492411  0.4984846
  0.7492411  0.4984846
  0.7511295  0.5022617
  0.7514649  0.5029324
  0.7506530  0.5013087
  0.7500176  0.5000379
  0.7500176  0.5000379
  0.7500176  0.5000379
  0.7429756  0.4859536
  0.7335862  0.4671746
  0.7291563  0.4583145
  0.7451464  0.4902957
  0.7394281  0.4788588
  0.7341864  0.4683749
  0.7458524  0.4917075
  0.7409989  0.4820003
  0.7363042  0.4726107
  0.7474055  0.4948139
  0.7427285  0.4854597
  0.7393046  0.4786117
  0.7389692  0.4779408
  0.7293858  0.4587735
  0.7216731  0.4433478
  0.7442993  0.4886013
  0.7367454  0.4734933
  0.7302682  0.4605385
  0.7450935  0.4901897
  0.7366043  0.4732109
  0.7329155  0.4658332
  0.7461524  0.4923076
  0.7415460  0.4830946
  0.7374867  0.4749758
  0.7509530  0.5019090
  0.7515001  0.5030032
  0.7512001  0.5024031
  0.7513413  0.5026854
  0.7517119  0.5034267
  0.7515531  0.5031090
  0.7509883  0.5019793
  0.7503882  0.5007793
  0.7506706  0.5013441
  0.7507059  0.5014147
  0.7509530  0.5019089
  0.7507589  0.5015206
  0.7504588  0.5009207
  0.7505471  0.5010971
  0.7499294  0.4998617
  0.7502824  0.5005676
  0.7502824  0.5005676
  0.7502824  0.5005676
  0.7514472  0.5028973
  0.7513060  0.5026150
  0.7509177  0.5018385
  0.7502647  0.5005322
  0.7502647  0.5005322
  0.7502647  0.5005322
  0.7486057  0.4972141
  0.7488528  0.4977085
  0.7472820  0.4945669
  0.7494529  0.4989085
  0.7495940  0.4991909
  0.7491881  0.4983791
  0.7497176  0.4994380
  0.7486057  0.4972142
  0.7475114  0.4950257
  0.7507412  0.5014853
  0.7505647  0.5011323
  0.7494175  0.4988379
  0.7499999  0.5000026
  0.7475644  0.4951315
  0.7467525  0.4935079
  0.7498412  0.4996851
  0.7498411  0.4996851
  0.7491352  0.4982731
  0.7506353  0.5012733
  0.7495764  0.4991557
  0.7471938  0.4943902
  0.7494705  0.4989438
  0.7501941  0.5003911
  0.7493117  0.4986261
  0.7506353  0.5012731
  0.7511825  0.5023673
  0.7501765  0.5003555
  0.7490116  0.4980253
  0.7489057  0.4978136
  0.7489763  0.4979548
  0.7497705  0.4995434
  0.7507589  0.5015202
  0.7506530  0.5013085
  0.7482880  0.4965782
  0.7482880  0.4965782
  0.7482880  0.4965782
  0.7486410  0.4972846
  0.7490469  0.4980965
  0.7491528  0.4983083
  0.7484998  0.4970015
  0.7484998  0.4970015
  0.7484998  0.4970015
  0.7497882  0.4995789
  0.7496293  0.4992612
  0.7493117  0.4986258
  0.7485880  0.4971780
  0.7485880  0.4971780
  0.7485880  0.4971780
  0.7405577  0.4811177
  0.7318920  0.4637857
  0.7251500  0.4503017
  0.7446876  0.4893778
  0.7385280  0.4770585
  0.7344511  0.4689042
  0.7413872  0.4827767
  0.7360042  0.4720106
  0.7313801  0.4627623
  0.7442464  0.4884955
  0.7399224  0.4798472
  0.7359336  0.4718695
  0.7359160  0.4718344
  0.7268266  0.4536551
  0.7235791  0.4471599
  0.7421814  0.4843653
  0.7363571  0.4727166
  0.7319095  0.4638211
  0.7392870  0.4785765
  0.7317154  0.4634330
  0.7282916  0.4565851
  0.7433109  0.4866246
  0.7385457  0.4770938
  0.7358100  0.4716222
  0.7512883  0.5025796
  0.7504588  0.5009205
  0.7502294  0.5004615
  0.7507236  0.5014501
  0.7505824  0.5011677
  0.7507236  0.5014501
  0.7498411  0.4996851
  0.7502824  0.5005675
  0.7490469  0.4980968
  0.7499294  0.4998618
  0.7496999  0.4994029
  0.7495234  0.4990499
  0.7501588  0.5003205
  0.7495234  0.4990498
  0.7488175  0.4976378
  0.7484822  0.4969667
  0.7484822  0.4969667
  0.7484822  0.4969667
  0.7498411  0.4996851
  0.7501411  0.5002852
  0.7497176  0.4994381
  0.7486763  0.4973550
  0.7486763  0.4973550
  0.7486763  0.4973550
  0.7185492  0.4370996
  0.7090363  0.4180734
  0.7031944  0.4063895
  0.7300035  0.4600089
  0.7204200  0.4408414
  0.7151606  0.4303224
  0.7291387  0.4582793
  0.7190787  0.4381588
  0.7143310  0.4286633
  0.7356865  0.4713753
  0.7292269  0.4584556
  0.7236498  0.4473010
  0.7128838  0.4257688
  0.7021002  0.4042015
  0.6974409  0.3948826
  0.7241616  0.4483249
  0.7159724  0.4319461
  0.7098482  0.4196974
  0.7235616  0.4471247
  0.7153724  0.4307459
  0.7093893  0.4187795
  0.7334098  0.4668216
  0.7237381  0.4474777
  0.7181256  0.4362525
  0.7476350  0.4952729
  0.7476703  0.4953435
  0.7479703  0.4959436
  0.7512178  0.5024384
  0.7510942  0.5021913
  0.7510413  0.5020855
  0.7495940  0.4991907
  0.7491704  0.4983437
  0.7493999  0.4988025
  0.7508118  0.5016265
  0.7506354  0.5012736
  0.7512178  0.5024384
  0.7480762  0.4961555
  0.7466819  0.4933668
  0.7462054  0.4924137
  0.7484821  0.4969673
  0.7484821  0.4969673
  0.7484821  0.4969673
  0.7487645  0.4975319
  0.7473349  0.4946728
  0.7477938  0.4955905
  0.7500529  0.5001085
  0.7500529  0.5001085
  0.7500529  0.5001085
  0.7479527  0.4959082
  0.7463995  0.4928020
  0.7439463  0.4878954
  0.7492234  0.4984497
  0.7493116  0.4986261
  0.7477938  0.4955905
  0.7470879  0.4941786
  0.7455348  0.4910722
  0.7442287  0.4884600
  0.7498235  0.4996499
  0.7492234  0.4984497
  0.7481291  0.4962611
  0.7479880  0.4959787
  0.7453230  0.4906485
  0.7431521  0.4863069
  0.7493646  0.4987320
  0.7484821  0.4969671
  0.7479703  0.4959434
  0.7481291  0.4962610
  0.7452170  0.4904368
  0.7437698  0.4875424
  0.7496117  0.4992263
  0.7488351  0.4976730
  0.7480056  0.4960140
  0.7484115  0.4968257
  0.7489057  0.4978139
  0.7485527  0.4971082
  0.7463290  0.4926600
  0.7463290  0.4926600
  0.7463290  0.4926600
  0.7499823  0.4999672
  0.7492764  0.4985553
  0.7489763  0.4979552
  0.7478644  0.4957313
  0.7476527  0.4953078
  0.7475997  0.4952019
  0.7491351  0.4982730
  0.7486410  0.4972848
  0.7492410  0.4984848
  0.7474938  0.4949899
  0.7474938  0.4949899
  0.7474938  0.4949899
  0.7498412  0.4996849
  0.7504765  0.5009557
  0.7503000  0.5006026
  0.7479174  0.4958370
  0.7479174  0.4958370
  0.7479174  0.4958370
  0.7271267  0.4542556
  0.7165019  0.4330054
  0.7117719  0.4235451
  0.7367631  0.4735285
  0.7296505  0.4593030
  0.7246029  0.4492073
  0.7336746  0.4673511
  0.7246029  0.4492072
  0.7183727  0.4367468
  0.7393575  0.4787175
  0.7324744  0.4649509
  0.7284327  0.4568672
  0.7247264  0.4494544
  0.7164665  0.4329342
  0.7092128  0.4184266
  0.7342746  0.4685513
  0.7264560  0.4529136
  0.7207376  0.4414768
  0.7294387  0.4588790
  0.7207024  0.4414062
  0.7136781  0.4273572
  0.7386163  0.4772349
  0.7304800  0.4609620
  0.7262089  0.4524194
  0.7476173  0.4952374
  0.7471408  0.4942841
  0.7465054  0.4930134
  0.7496117  0.4992263
  0.7497529  0.4995087
  0.7495764  0.4991557
  0.7475997  0.4952021
  0.7489057  0.4978142
  0.7471408  0.4942841
  0.7490293  0.4980612
  0.7490822  0.4981671
  0.7493117  0.4986260
  0.7464878  0.4929785
  0.7442463  0.4884956
  0.7429227  0.4858480
  0.7474762  0.4949548
  0.7474762  0.4949548
  0.7474762  0.4949548
  0.7472643  0.4945315
  0.7460642  0.4921309
  0.7449876  0.4899779
  0.7470173  0.4940370
  0.7470173  0.4940370
  0.7470173  0.4940370
  0.7019590  0.4039187
  0.6950758  0.3901523
  0.6912637  0.3825278
  0.7182492  0.4364995
  0.7092305  0.4184618
  0.7057536  0.4115080
  0.7166431  0.4332874
  0.7083480  0.4166965
  0.7007235  0.4014475
  0.7276032  0.4552081
  0.7188492  0.4376996
  0.7139604  0.4279217
  0.6971938  0.3943880
  0.6900458  0.3800921
  0.6876809  0.3753620
  0.7145782  0.4291575
  0.7048535  0.4097077
  0.7018708  0.4037425
  0.7096893  0.4193794
  0.7018531  0.4037068
  0.6938227  0.3876458
  0.7228379  0.4456775
  0.7145782  0.4291574
  0.7081539  0.4163087
  0.7456230  0.4912488
  0.7438051  0.4876128
  0.7423932  0.4847888
  0.7492234  0.4984496
  0.7491881  0.4983790
  0.7494528  0.4989085
  0.7483762  0.4967553
  0.7468231  0.4936490
  0.7470702  0.4941431
  0.7491528  0.4983084
  0.7491881  0.4983791
  0.7492058  0.4984144
  0.7401164  0.4802354
  0.7369396  0.4738814
  0.7344158  0.4688338
  0.7481291  0.4962611
  0.7481291  0.4962611
  0.7481291  0.4962611
  0.7433639  0.4867305
  0.7426050  0.4852127
  0.7407871  0.4815768
  0.7489410  0.4978850
  0.7489410  0.4978850
  0.7489410  0.4978850

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 1000, max_depth = 6, eta
 = 0.1, gamma = 10, colsample_bytree = 0.5, min_child_weight = 0 and
 subsample = 1.


To get an idea how much the accuracy differs for various sets of tuning parameters, we plot the accuracy for varying values for eta, gamma, max_depth and nrounds while keeping the other parameters fixed.

## plot the accuracy for different combinations of tuning parameters
xgboost_tune$results %>% 
  # find the max accuracy data point
  mutate(max_acc = factor(max(Accuracy) == Accuracy)) %>%
  # fix the hyperparameters that are not of interest
  filter(colsample_bytree == 0.5, min_child_weight == 0, subsample == 1.0) %>% 
  # convert the following variables to factor 
  mutate_at(vars(gamma, max_depth, nrounds), list(factor)) %>% 
  # create ggplot
  ggplot(aes(x = eta, y = Accuracy, color = max_depth))+
  # change the point shape and size for the maximum accuracy point
  geom_point(aes(shape = max_acc, size = max_acc))+
  geom_line()+
  # facet the plot on the combination of the following two variables
  facet_grid(gamma~nrounds, labeller = label_both, scales = "free_y")+
  labs(title="Accuracy for selected tuning parameters", subtitle = "Fixed tuning parameters: colsample_bytree = 0.5, min_child_weight = 0, subsample = 1.0", shape = "Locate the higest accuracy point")+
   # specify the max accuracy point shape
   scale_shape_manual(values = c(20, 8), labels = c("accuracy", "Maximum accuracy"))+
   # specify the max accuracy point size
   scale_size_manual(values = c(0.7, 2.5))+
   # get rid of one part of the legend and change the theme
   guides(size="none") + theme_light() + theme(legend.position = "bottom")
Accuracy per combination of tuning parameters

Figure 4.4: Accuracy per combination of tuning parameters

In this subset of options, the accuracy for eta= 0.1, nrounds = 1000, max_depth = 6, and gamma= 10 is the highest. Indeed, we can see that this is the best set of hyperparameters found by the algorithm across all the different combinations (see below).

The best set of hyperparameters
nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
1000 6 0.1 10 0.5 0 1

4.2.4 XGB final model

Next, we run the XGBoost algorithm with the best set of hyperparameters identified above. We use three times repeated 10-fold cross-validation and save the identified final model.

# best tuning parameter set
final_grid <- xgboost_tune$bestTune

# specify the internal validation settings
train_control <- trainControl(method = "repeatedcv", 
                          number = 10,
                          repeats = 3,
                          allowParallel = TRUE)

# obtain the final model based on the best parameter set
xgb_model_final <- train(Diabetes_binary ~ ., 
                       data = diabetes_train,
                       trControl = train_control,
                       tuneGrid = final_grid,
                       method = "xgbTree",
                       verbose = TRUE)

# save the result as Rdata since it takes a bit long to run
save(xgb_model_final, file = "xgb_final.RData")

4.2.5 Prediction evaluation

In the following, the prediction performance of the final model is evaluated by looking at different performance measures obtained using a confusion matrix.

# load the final model
load("Data/Xgb_final.RData")
# final model 
print(xgb_model_final)
eXtreme Gradient Boosting 

56660 samples
   21 predictor
    2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 50994, 50994, 50994, 50995, 50994, 50995, ... 
Resampling results:

  Accuracy   Kappa    
  0.7523473  0.5046975

Tuning parameter 'nrounds' was held constant at a value of 1000

Tuning parameter 'min_child_weight' was held constant at a value of 0

Tuning parameter 'subsample' was held constant at a value of 1
# make predictions on the test dataset
levels(diabetes_test$Diabetes_binary) = c("No","Yes")
pred <- predict(xgb_model_final, newdata = diabetes_test[,-1])

# create the confusion matrix
confusionMatrix(
   diabetes_test$Diabetes_binary, 
   pred, 
   positive = "Yes", 
   dnn = c("Prediction", "Truth")) 
Confusion Matrix and Statistics

          Truth
Prediction   No  Yes
       No  5081 1933
       Yes 1450 5568
                                         
               Accuracy : 0.7589         
                 95% CI : (0.7517, 0.766)
    No Information Rate : 0.5346         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.5178         
                                         
 Mcnemar's Test P-Value : < 2.2e-16      
                                         
            Sensitivity : 0.7423         
            Specificity : 0.7780         
         Pos Pred Value : 0.7934         
         Neg Pred Value : 0.7244         
             Prevalence : 0.5346         
         Detection Rate : 0.3968         
   Detection Prevalence : 0.5001         
      Balanced Accuracy : 0.7601         
                                         
       'Positive' Class : Yes            
                                         
  • First, from the confusion matrix, we see that there are 5081 true negatives (TN), 1933 false negatives (FN), 1450 false positives (FP), and 5568 true positives (TP). One thing to note is that our reference category (i.e., ‘Positive’ Class) is originally set to “NO” (0), which refers to no diabetes but in order to simplify the interpretation, we manually changed the ‘Positive’ Class = Yes.

  • In the second part, we see that Accuracy = TP + TN / (TP + TN + FP + FN) = 0.76, with the 95% CI [0.7517, 0.766]. The “no-information rate” is the largest proportion of the observed classes (there were slightly more “Yes”, hence the rate is 0.53). A hypothesis test is also carried out to evaluate whether the overall accuracy rate is greater than the no-information rate. Given that p-value is very small (\(p < .001\)), we reject the null hypothesis (accuracy rate is the same as the no-information rate) and conclude that the accuracy rate is greater than the no-information rate.

  • As explained earlier in the section 4.1.3, Kappa (Cohen’s Kappa) gives the estimate of level of agreement. Again following the suggestion of Landis and Koch (1977) (i.e., Kappa value between 0.41 and 0.60 are considered as moderate agreement), we conclude that there is a moderate agreement in our case given that Kappa value is 0.52.

  • McNemar’s test p-value is very small (\(p <.001\)), we reject the null hypothesis and conclude that there is a disagreement between observed and predicted proportion. Again, this is not surprising as there is quite a portion of mis-predicted cases.

  • Sensitivity = TP / (TP + FN) = 0.74, meaning that if one has diabetes, then there is 74% probability that the model will detect this.

  • Specificity = TN / (TN + FP) = 0.78, meaning that if one doesn’t have diabetes, then there is 78% probability that the model will detect this.

  • Positive predicted value (PPV) = TP / (FP + TP) = 0.79, meaning that if one is predicted to have diabetes, then there is 79% probability that the person actually has diabetes.

  • Negative predicted value (NPV) = TN / (TN + FN) = 0.72, meaning that if one is predicted to not have diabetes, then there is 72% probability that the person does not actually have diabetes.

  • Balanced accuracy = (sensitivity + specificity)/2 = 0.76. It is very similar to the overall accuracy, and it is likely due to the fact that we have a balanced dataset. In addition, it is not shown above, but the F1 score = 2 \(\cdot\) (precision \(\cdot\) recall) / (precision + recall) = 0.77. Again, it is not too different from the accuracy, which makes sense as our dataset was balanced and there were equal number cases in both classes (Yes/No Diabetes).

4.2.6 Variable importance

Here we try out the global feature importance calculations that come with the caret package. Note that the importance of variable could be inconsistent such that the order in which a model sees features can affect its estimation.

## check variable importance (top 15)
varImp(xgb_model_final, scale=TRUE)[["importance"]] %>% 
   rename("Importance" = "Overall") %>%
   # subset the top 15
   slice_max(Importance, n = 15) %>%
   # create ggplot for importance 
   ggplot(aes(
      x = reorder(rownames(.), Importance), 
      y = Importance, 
      fill = Importance)) + 
   geom_bar(stat="identity")+
   # add the importance value
   geom_text(
      aes(label = round(Importance, 2), col = Importance), 
      hjust = -0.2) +
   # specify the gradient color
   scale_fill_gradient(low = "grey40", high = "#28B463") + 
   scale_color_gradient(low = "grey40", high = "#28B463") +
   scale_y_continuous(limits = c(0, 104)) +
   coord_flip() + theme_minimal() +
   # edit the labels
   theme(legend.position = "none", axis.title.y = element_blank()) +
   labs(title = "XGBoosting - Variable Importance")
Variable imporatnce based on XGboosting

Figure 4.5: Variable imporatnce based on XGboosting

Given the plot above, we can see that HighBP1, BMI, HighChol1, Age, and DiffWalk1 are relatively more important than the others. It is interesting to note that these variables correspond to the variables that are highly correlated as we saw earlier in the correlation plot in the section 2.2.3.

4.2.6.1 SHAP (SHapley Additive exPlanations)

The possible inconsistency described above motivates the use of SHapley Additive exPlanations (SHAP) values since they guarantee consistency by computing the average of the marginal contribution across all permutations (i.e., all possible orders). On top of that, the collective SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable. This is like the variable importance plot but it is able to show the positive or negative relationship for each variable with the target (see Figure 4.7). Lastly, another advantage of using SHAP value is that each observation gets its own set of SHAP values. This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors. Traditional variable importance plots only show the results across the entire population but not on each individual case. The local interpretability enables us to pinpoint and contrast the impacts of the factors.

## compute the SHAP values
# download `shap.R` file from the Github repo
source_url("https://github.com/KyuriP/shap-values/blob/master/shap.R?raw=TRUE")

# define predictor matrix (x)
train_x <- model.matrix(Diabetes_binary ~ ., diabetes_train)[,-1] # exclude intercepts

# return a matrix of SHAP score and mean ranked score list
shap_results <- shap.score.rank(xgb_model_final$finalModel,
                                X_train = train_x,
                                shap_approx = F)

Here we plot the variance importance plot based on the average SHAP values.

## variable importance plot (top 15) based on SHAP value rank
# extract the average SHAP scores
shap_importance <- data.frame(shap_results$mean_shap_score) %>% 
   rename("Importance" = "shap_results.mean_shap_score")

shap_importance %>%
   # subset the top 15 
   slice_max(Importance, n = 15) %>%
   # create ggplot for the shap scores (renamed to importance)
   ggplot(aes(
      x = reorder(rownames(.), Importance), 
      y = Importance, 
      fill = Importance)) + 
   geom_bar(stat="identity")+
   # add the importance value
   geom_text(aes(label = round(Importance, 2), col = Importance), hjust = -0.2) +
   # specify the gradient color
   scale_fill_gradient(low = "grey40", high = "#28B463") + 
   scale_color_gradient(low = "grey40", high = "#28B463") +
   scale_y_continuous(limits = c(0, 0.5)) +
   coord_flip() + theme_minimal() +
   # edit the labels
   theme(legend.position = "none", axis.title.y = element_blank()) +
   labs(title = "Shap Feature Importanace")
Variable imporatnce based on SHAP values

Figure 4.6: Variable imporatnce based on SHAP values

The graph above is plotted based on the average of the SHAP value magnitudes across the dataset. One of the things that stand out the most in this plot is the changed rank order. Based on the SHAP values, the important variables are in the following order: BMI > HighBP1> Age > HighChol1, followed by GenHlth.


Rather than plotting a typical feature importance bar chart, we can also use a density scatter plot of SHAP values for each feature to identify how much impact each feature has on the model output for individuals in the validation dataset. Features are sorted by the sum of the SHAP value magnitudes across all samples.

## summary plot for SHAP values (top 15 variables)
shap_long <- shap.prep(shap = shap_results, X_train = train_x, top_n = 15)
plot.shap.summary(shap_long)
SHAP value summary plot

Figure 4.7: SHAP value summary plot

In the chart above, the x-axis represents the SHAP value, and the y-axis shows all the features. Each point on the chart is one SHAP value for a prediction and feature. Purple color means higher value of a feature and yellow color means lower value of a feature. We can get the general sense of features’ directionality impact based on the distribution of the dots. Overlapping dots are jittered in y-axis direction, so we get a sense of the distribution of the SHAP values per feature.
Positive SHAP value means positive impact on prediction, leading the model to predict “No” (reference class = “No”: no diabetes). Negative SHAP value means negative impact, leading the model to predict “Yes” (has diabetes).
For example:

  • Those with high BMI tend to have negative SHAP scores, meaning that they have a higher probability of being predicted to “Yes”. That is, those with high BMI have a higher probability of having diabetes.

  • Those with high BP (blood pressure), age, high cholesterol have negative SHAP values, meaning that they have a higher probability of having diabetes.


In the summary plot, we see the indications of the relationship between the value of a feature and the impact on the prediction. But to see the exact form of the relationship, we have to look at SHAP dependence plots.

## SHAP contribution dependency plot (top 15)
xgb.plot.shap(train_x, model = xgb_model_final$finalModel, n_col = 3, top_n = 15)
SHAP dependence plot

Figure 4.8: SHAP dependence plot

  • Looking at BMI, we can see how increase in BMI is associated with a decrease in SHAP values (w.r.t. not having diabetes). Interesting to note that around the BMI value of 60, the curve starts to increase again, which reflects a non-linear relationship between BMI and (not) having diabetes.

  • With high blood pressure (HighBP) and high cholesterol (HighChol), we can see the increase in the level of blood pressure and cholesterol leads to decrease in SHAP values, meaning that the higher blood pressure and cholesterol level are, the more likely to be predicted to have diabetes. The same pattern is observed with GenHlth, DiffWalk, HeartDiseaseorAttack, and CholCheck.

  • When it comes to sex (1 = male, 0 = female), we can see that male (sex=1) individuals are highly associated with lower SHAP values compared to the female individuals. It indicates that males are more likely to be predicted to have diabetes.


5 Overall Conclusion

In this last part, we compare the performance of the Binary Logistic Regression model and the XGBoosting model based on overall Accuracy/Kappa, ROC curve, and calibration. We also summarize some evaluation metrics that are shown previously in each of the confusion matrices. In addition, we revisit the important variables, trying to get an insight into the relevant factors that are associated with developing diabetes

## Accuracy and Kappa comparison
# Accuracy per model
model_summary_acc <- data.frame(method="Logistic Regression", Accuracy = model$resample$Accuracy) %>%
   rbind(data.frame(method = "XGBoosting", Accuracy = xgb_model_final$resample$Accuracy
))

# Kappa values per model
model_summary_kappa <- data.frame(method="Logistic Regression", Kappa = model$resample$Kappa) %>%
   rbind(data.frame(method = "XGBoosting", Kappa = xgb_model_final$resample$Kappa
))

# create a boxplot for accuracy  
acc <- ggplot(model_summary_acc, aes(x = method, y = Accuracy)) + 
  geom_boxplot(outlier.shape = NA, aes(fill = method, alpha = 0.3)) +
  geom_jitter(alpha = 0.2, position = position_jitter(w = 0.05), col = "grey20") +
  theme_classic() + labs(y = "Accuracy", x="") +
  theme(legend.position = "none") 

# create boxplot for kappa
kappa <- ggplot(model_summary_kappa, aes(x = method, y = Kappa)) + 
  geom_boxplot(outlier.shape = NA, aes(fill = method, alpha = 0.3)) +
  geom_jitter(alpha = 0.2, position = position_jitter(w = 0.05), col = "grey20") +
  theme_classic() + labs(y = "Kappa", x ="") +
  theme(legend.position = "none") 

# put the plots in 1 by 2 grid
plot5.1 <- ggarrange(plotlist = list(acc, kappa), ncol = 2)

# add the title of figure
annotate_figure(
   plot5.1, 
   top = text_grob("Model Resample Performance Comparison", 
   size = 13))
Overall accuracy and kappa comparison

Figure 5.1: Overall accuracy and kappa comparison

Figure 5.1 shows that XGBoosting has a higher accuracy and Kappa compared to the logistic regression. Also, the spread of two measures seems to be relatively larger with the logistic regression, meaning that there was more variance in predictions for logistic regression compared to XGboosting model.

## ROC comparison
# predicted probability for each model
pred_lr <- predict(model, newdata = diabetes_2_test, type = "prob")$Diabetic
pred_boost <- predict(xgb_model_final, newdata = diabetes_test[,-1], type = "prob")

# roc for each model
roc_lr <- roc(diabetes_2_test$Diabetes_binary, pred_lr)
roc_boost <- roc(diabetes_test$Diabetes_binary, pred_boost$No)

# auc for each model
auc_lr <- as.numeric(roc_lr$auc)
auc_boost <- as.numeric(roc_boost$auc)

# plot the ROC for each model
ggroc(list(roc_lr, roc_boost), legacy.axes = TRUE, aes=c("color")) +
  # add the diagonal line
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1), color="grey", linetype="dashed", lwd=0.3) +
  # specify the labels
  scale_color_discrete(labels = c("logistic regression", "XGboosting")) + 
  theme_classic() + labs(title = "ROC for Logistic Regression and XGboosting", color = "models") +
  # add AUC values
  annotate("text", x = 0.9, y = .1, 
           label = paste("AUC_lr:", round(auc_lr,2), "\nAUC_xgboost:", round(auc_boost, 2)))
ROC comparison

Figure 5.2: ROC comparison

The AUC of XGBoosting (0.84) is slightly higher than the AUC of the logistic regression (0.83). In the ROC plot, it can also be seen that the curve for the XGBoosting model is always above the ROC for the logistic regression model even though it is trivial and the ROC curves almost overlap.

## Calibration comparison
# obtain the calibration statistics
calibration_xgboost <- calibration(factor(diabetes_test[,1]) ~ pred_boost$Yes, class = "Yes")$data
calibration_lr <- calibration(true ~ pred_lr, class = "Diabetic")$data

# plot the calibration plot
ggplot()  +
   geom_line(data = calibration_lr, aes(midpoint, Percent, color = "lr")) +
   geom_point(data = calibration_lr, aes(midpoint, Percent,  color = "lr"), size = 2) +
   geom_line(data = calibration_xgboost, aes(midpoint, Percent, color = "boost"))+
   geom_point(data = calibration_xgboost, aes(midpoint, Percent, color = "boost"), size = 2) +
   geom_line(aes(c(0, 100), c(0, 100)), linetype = 2, color = alpha('grey50', 0.5)) +
   scale_color_manual(name = "Model", 
                     values = c( "lr" = "#F8766D", "boost" = "#00BFC4"), 
                     labels = c("logistic regression", "xgboost")) + 
   labs(title="Calibration plot for predicted probabilities by logistic regression and xbgoost", 
        x ="Bin Midpoint", color = "model") +
   theme_minimal() 
Calibration comparison

Figure 5.3: Calibration comparison

Calibration plot shows that the calibration of both model is almost equally good, as both lines follow the diagonal well. As is the case in ROC, again the difference is minimal between the models, yet we can see that XGboosting model is slightly better than logistic regression model (as the logistic regression model deviates from the diagonal more around the Bin Midpoint of 50 and 80).

## estimate the decision boundary using the continuous variables (Age & BMI)
## logistic regression
# estimate the logistic model
lr_mod <- glm(Diabetes_binary ~ Age + BMI, data = diabetes_train, family="binomial")
# predict classes as per predicted probabilities
lr_prob <- predict(lr_mod, diabetes_test[1:9801,], type = "response")
# ref.cat = nondiabetes
lr_classes <- ifelse(lr_prob > 0.5, 1, 0)
lr_test <- data.frame(diabetes_test[1:9801, c("Diabetes_binary","Age","BMI")], group=lr_classes)

# create prediction grid (use min/max of each variable)
lgrid <- expand.grid(Age=seq(min(diabetes_test$Age), max(diabetes_test$Age), by=0.1), 
                     BMI=seq(min(diabetes_test$BMI), max(diabetes_test$BMI)), by=0.1)

# compute predicted grid for logistic reg.
lr_PredGrid <- ifelse(predict(lr_mod, newdata=lgrid, type="response") > 0.5, 0, 1)
lrPredGrid_prob <- predict(lr_mod, newdata=lgrid, type = 'response')

#plot decision boundary for logistic reg.
plot_lr <- ggplot(data = lgrid, mapping = aes(x = Age, y = BMI))+
   geom_raster(data = lgrid, mapping = aes(fill = lr_PredGrid), 
               alpha = lrPredGrid_prob, interpolate = TRUE)+
   scale_fill_gradientn(colours = c("#E75B64", "#278B9A"), breaks = c(1, 2), 
                        labels = c("Nondiabetic", "Diabetic")) +
   geom_point(data = lr_test, aes(x=Age, y=BMI, shape = as.factor(Diabetes_binary)), 
              size=1, color="darkslategray")+
   labs(title = "Logistic regression", shape ="Diabetes")  +
   scale_shape_manual(values = c(16, 17))+
   guides(fill = guide_legend(title="Diabetes prediction")) +
   theme_classic()

## XGboosting
# obtain the xgboost trained model
# boosting <- train(Diabetes_binary ~ Age + BMI, 
#                   data = diabetes_train,
#                   method = "xgbTree")
#
# save it to an RData as it takes some time to run
#save(boosting, file = "Data/boosting.RData")

# load the boosting model
load("Data/boosting.RData")
# predict classes and obtain corresponding predicted probabilities
classes_boosting <- predict(boosting, newdata = diabetes_test[1:9801,-1])
probs_boosting <- predict(boosting, newdata = diabetes_test[1:9801,-1], type = "prob")
test_boost <- data.frame(diabetes_test[1:9801, c("Diabetes_binary","Age","BMI")], group=classes_boosting)

# compute predicted grid for xgboost
boost_PredGrid <- as.numeric(predict(boosting, newdata=lgrid))
boostPredGrid_prob <-predict(boosting, newdata=lgrid, type = 'prob')[,1]

# plot decision boundary for xgboost
plot_boost <- ggplot(data=lgrid, mapping = aes(x=Age, y=BMI))+
   geom_raster(data = lgrid, mapping = aes(fill=boost_PredGrid), 
               alpha = boostPredGrid_prob, interpolate = TRUE)+
   geom_point(data = test_boost, aes(x=Age, y=BMI, shape=as.factor(Diabetes_binary)), 
              size=1, color="darkslategray") + 
   labs(title = "XGboosting", shape ="Diabetes") +
   scale_shape_manual(values = c(16, 17))+
   scale_fill_gradientn(colours=c("#278B9A","#E75B64"), breaks = c(1, 2), 
                        labels = c("Nondiabetic","Diabetic"))+
   guides(fill = guide_legend(title="Diabetes prediction")) +
   theme_classic()
   
# combine the plots and annotate
boundaryplot <- ggarrange(plot_boost, plot_lr, common.legend = TRUE, legend="bottom")
annotate_figure(boundaryplot, top = text_grob("Decision boundary plots", size = 16))
Decision boundary comparison

Figure 5.4: Decision boundary comparison

To get a better intuition on how these two models predict, we tried plotting the decision boundary for both models using the continuous variables Age and BMI. We expected that the logistic regression model would have a somewhat linear boundary, while xgboost model would show a more flexible boundary. As shown in Figure 5.4, the decision boundary of logistic regression is more or less linear, while the boundary of XGboosting is very curvy and divides the feature space flexibly, which is in accordance with our expectation. Note that this plot is made for illustrative purposes, and the models used here to estimate the boundaries are not the actual final models but their reduced versions.

Table 5.1: Comparison on evaluation metrics
Model Precision F1score Balanced_accuracy
logistic regression 0.746 0.758 0.754
XGboosting 0.793 0.774 0.760
Table 5.2: Comparison on important variables
Rank Logistic.regression XGboosting
1 BMI BMI
2 GenHlth_4 HighBP1
3 Age Age
4 GenHlth_5 HighChol1
5 GenHlth_3 GenHlth4
6 HighBP_1 GenHlth3
7 HighChol_1 GenHlth5
8 GenHlth_2 DiffWalk1
9 CholCheck_1 Sex1
10 Sex_1 Income8
11 HvyAlcoholConsump_1 HeartDiseaseorAttack1
12 Income_8 GenHlth2
13 HeartDiseaseorAttack_1 HvyAlcoholConsump1
14 Income_7 CholCheck1
15 Income_6 Education6

All things considered, we conclude that the XGboosting approach – which is the more complex model we chose – performs better than the comparatively simpler Binary Logistic Regression model. Table 5.1 summarizes some of the evaluation metrics, and again it supports the superior performance of XGboosting model as it has a higher precision, F1score, and balanced accuracy.

Regarding the fact that we did not find any dramatic differences between the two models, we assume that it is due to the good fit of logistic regression. More complex models (e.g., XGboosting) often work much better when the base model (e.g., logistic regression) severely under/overfits the data, which is in fact not the case here. In addition, concerning the fact that the overall accuracy is consistently estimated around 75% from both models, we think that it is basically the limitation of what can be inferred from the dataset we have. We could partly expect such a mediocre performance from the correlations we observed in 2.2.3, where the overall correlations between the predictors and diabetes were not super high.

With regard to our second aim (i.e., getting insight into risk factors of diabetes), we revisit the important variables from each model. Table 5.2 shows the top 15 variables and we can see that both models indicate that BMI, Age, general health, blood pressure, and cholesterol level are the most important variables for predicting diabetes. This actually corresponds to the relatively highly correlated variables shown in 2.2.3, which naturally makes sense. Together with the SHAP dependence plot we saw earlier, it can be concluded that high BMI, high blood pressure, high age, poor general health, and high cholesterol level are the crucial risk factors of diabetes.


6 References

Gareth, James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. “An Introduction to Statistical Learning with Applications in R.” Springer Science and Business Media 6 (1): 612. https://doi.org/10.1080/24754269.2021.1980261.
Kirasich, Kaitlin, Trace Smith, and Bivin Sadler. 2018. “Random Forest Vs Logistic Regression: Binary Classification for Heterogeneous Datasets.” SMU Data Science Review 1 (3). https://scholar.smu.edu/datasciencereview/vol1/iss3/9.
Landis, J. Richard, and Gary G. Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159–74. https://doi.org/10.2307/2529310.
Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.
Martin-Barragan, Belen, Rosa Lillo, and Juan Romo. 2014. “Interpretable Support Vector Machines for Functional Data.” European Journal of Operational Research 232 (1): 146–55. https://doi.org/10.1016/j.ejor.2012.08.017.
Russell, Stuart J., Peter Norvig, and Ernest Davis. 2010. Artificial Intelligence: A Modern Approach. 3rd ed. Prentice Hall Series in Artificial Intelligence. Upper Saddle River: Prentice Hall.
Sarker, Iqbal H. 2021. “Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions.” SN Computer Science 2 (6): 420. https://doi.org/10.1007/s42979-021-00815-1.

## Query and print information about the current R session
sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 13.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pander_0.6.5       pROC_1.18.0        plotROC_2.3.0      lm.beta_1.6-2     
 [5] RColorBrewer_1.1-3 xgboost_1.6.0.1    devtools_2.4.4     usethis_2.1.6     
 [9] forcats_0.5.1      stringr_1.4.1      dplyr_1.0.9        readr_2.1.2       
[13] tidyr_1.2.0        tibble_3.1.7       tidyverse_1.3.2    gtsummary_1.6.2   
[17] kableExtra_1.3.4   knitr_1.40         caret_6.0-93       lattice_0.20-45   
[21] ggpubr_0.4.0       purrr_0.3.4        ggplot2_3.3.6      magrittr_2.0.3    

loaded via a namespace (and not attached):
  [1] readxl_1.4.0         backports_1.4.1      systemfonts_1.0.4   
  [4] plyr_1.8.7           splines_4.2.0        listenv_0.8.0       
  [7] digest_0.6.29        foreach_1.5.2        htmltools_0.5.2     
 [10] fansi_1.0.3          memoise_2.0.1        googlesheets4_1.0.0 
 [13] tzdb_0.3.0           remotes_2.4.2        recipes_1.0.1       
 [16] globals_0.15.1       modelr_0.1.8         gower_1.0.0         
 [19] svglite_2.1.0        hardhat_1.2.0        prettyunits_1.1.1   
 [22] colorspace_2.0-3     rvest_1.0.2          haven_2.5.0         
 [25] xfun_0.31            callr_3.7.1          crayon_1.5.1        
 [28] jsonlite_1.8.0       survival_3.3-1       iterators_1.0.14    
 [31] glue_1.6.2           gtable_0.3.0         gargle_1.2.0        
 [34] ipred_0.9-13         webshot_0.5.3        car_3.1-0           
 [37] pkgbuild_1.3.1       future.apply_1.9.0   abind_1.4-5         
 [40] scales_1.2.0         DBI_1.1.3            rstatix_0.7.0       
 [43] miniUI_0.1.1.1       Rcpp_1.0.9           viridisLite_0.4.0   
 [46] xtable_1.8-4         proxy_0.4-27         stats4_4.2.0        
 [49] lava_1.6.10          prodlim_2019.11.13   profvis_0.3.7       
 [52] htmlwidgets_1.5.4    httr_1.4.3           ellipsis_0.3.2      
 [55] farver_2.1.1         urlchecker_1.0.1     pkgconfig_2.0.3     
 [58] nnet_7.3-17          sass_0.4.1           dbplyr_2.2.1        
 [61] utf8_1.2.2           labeling_0.4.2       tidyselect_1.1.2    
 [64] rlang_1.0.5          reshape2_1.4.4       later_1.3.0         
 [67] munsell_0.5.0        cellranger_1.1.0     tools_4.2.0         
 [70] cachem_1.0.6         cli_3.3.0            generics_0.1.3      
 [73] broom_1.0.0          evaluate_0.15        fastmap_1.1.0       
 [76] yaml_2.3.5           ModelMetrics_1.2.2.2 processx_3.7.0      
 [79] fs_1.5.2             future_1.27.0        nlme_3.1-158        
 [82] mime_0.12            xml2_1.3.3           compiler_4.2.0      
 [85] rstudioapi_0.13      e1071_1.7-11         ggsignif_0.6.3      
 [88] gt_0.7.0             reprex_2.0.2         broom.helpers_1.9.0 
 [91] bslib_0.3.1          stringi_1.7.8        highr_0.9           
 [94] ps_1.7.1             Matrix_1.4-1         vctrs_0.4.1         
 [97] pillar_1.7.0         lifecycle_1.0.1      jquerylib_0.1.4     
[100] cowplot_1.1.1        data.table_1.14.2    httpuv_1.6.5        
[103] R6_2.5.1             bookdown_0.27        promises_1.2.0.1    
[106] gridExtra_2.3        parallelly_1.32.1    sessioninfo_1.2.2   
[109] codetools_0.2-18     fastDummies_1.6.3    MASS_7.3-58.1       
[112] assertthat_0.2.1     pkgload_1.3.0        withr_2.5.0         
[115] mgcv_1.8-40          parallel_4.2.0       hms_1.1.1           
[118] grid_4.2.0           rpart_4.1.16         timeDate_4021.104   
[121] class_7.3-20         rmarkdown_2.16       carData_3.0-5       
[124] googledrive_2.0.0    shiny_1.7.2          lubridate_1.8.0